General guidelines for conducting research in Oncinfo lab

  1. All google docs that need to be edited by lab members should be put in Oncinfo folder. They should be kept confidential. Send your gmail address to Habil to get access to this folder. Then, create a subfolder with your name there, and create a google doc in your subfolder. Copy all items from this “For members” page to that google doc, and write “Done”, “Todo”, “Skip” in front of each item.
  2. Pass the online training courses required by the University e.g., conflict of interest, safety, etc.
  3. All experiments and analysis are done on Unix. That is a real Unix system like Linux, OS X, etc., NOT a virtual machine. Start with a tutorial for beginners or the introduction to Bio-Linux.
  4. R is primarily used for statistical analysis and other scripting purposes in Oncinfo Lab. This is a good online course on R which takes about 1 month to complete. A couple of days should be enough to read this good guide for starters to get the basis ideas, or cover the introduction section from R-Tutorial. DataCamp facilitates reading about R and running examples at the same time using a browser . Those who know R to some extend can use the book Bioinformatics with R Cookbook or Advanced R by Hadley Wickham to gradually learn more as they proceed in a project. The next step after learning R is to learn Bioconductor .
  5. Using Emacs as a powerful, general purpose, text editor is encouraged (tutorial). In terminal, you can start it by typing emacs even in an SSH session. On Ubuntu you can simply install Emacs using Software Center, or by Package Synaptics, or by the following command: sudo apt-get install emacs. On OS X, you can install Emacs For MAC OS X (preferred) or Aquamacs. You can customize your emacs by editing .emacs file Habil's].
  6. Using proprietary file formats is not professional when you are sharing information (e.g., your CV) with others. The pdf and png formats are OK and portable. Use Google Docs instead of .docx, and Google Presentation instead of .ppt.
  7. This video illustrates transcription (wikipedia, video 2), more videos on gene expression (wikipedia), translation (detailed), etc.
  8. All members should know about central dogma of biology which is almost enough biological knowledge to start the majority of projects pdf]. Familiarity with some basic concepts such as exon, intron, etc. is helpful. Watch animations from DNA Learning Center.
  9. Any file or data on this wiki that has restricted permissions, such as some paper pdfs or drafts, should not be shared with nonmembers unless authorized by the PI.
  10. All members should read and follow Bill's guidelines, and organize their files and folders accordingly and to some extend. Start by making a “~/proj” directory in your home folder that will eventually contain a subfolder for each project you are working on. Major subfolders must have a readme file for example to describe where the data is coming from. Your code folder must include a runall.R script that sources other scripts. Avoid sourcing scripts in other scripts except for the runall because then following and debugging the pipeline would be difficult.
  11. Your code and documents should be stored in a Bitbucket repository like https://bitbucket.org/habilzare/genetwork. Sign up for an account and sent your username to Habil. Do not forget to add your photo. If you are new to Bitbucket, take Bitbucket 101. You can avoid having to manually type a password each time you pull using ssh. To add a key, click on your photo at the top right corner of Bitbucket page, Bitbucket settings, SSH keys, Add key. This trick is not appropriate for TACC clusters because we should not change our .ssh folder there. On the cluster, use https to clone instead of ssh. Do NOT skip this step. Before changing anything in a repository, read and abide to the conventions described in the main readme file.
  12. Do NOT use space in the file or folder names. Do NOT include binary files such as png, pdf, RData, etc. in a Bitbucket repository unless on an exceptional basis. Instead, use e.g., rsync -avz -e ssh <usrname>@ls5.tacc.utexas.edu or scp to transfer files, and document the exact paths in a readme file in the corresponding folder.
  13. If you want to use TACC resources, you first create an account, and then ask Habil to add you to a project. A simple test for running a job on Stampede cluster is the following. Look at their user guide or this table of commands for more details.
    $ ssh <username>@stampede.tacc.utexas.edu
    $ cd ~zare
    login4.stampede(1)$ sbatch -p normal -n 1 -t 3 ./test.sh
    We usually use Lonestar5 for computing and Ranch for storage of large data.
  14. Every member should upload their photo to his profile in the wiki. Todo this, click on your username at the top right, then, Account. In addition, everyone should have a photo and their updated CV in pdf format on their personal page. This is an optional LaTeX template. The permission of the lab notebooks should be set to “hidden”and it is important that they be updated EVERY day. Write your posts in anti-chronological order so that the newest post comes at the top.
  15. On your wiki account, got o Settings > Email Monitored Changes and set it to “Yes - one email per email per change”. In this way, we can use the Discussion tab for each page and avoid sending too many emails for updates or asking question. A history of the discussions is saved too which is conveniently available.
  16. On your lab notebook, click on “…” (More options) > Notify, and check “Page discussions”, so that when somebody wants to discuss something with you, you get notified by email.
  17. You can install Google Scholar Button add-on for an easier way of searching Google Scholar. You select the paper title and then click on the little blue icon on the top right corner. For any paper which you want to cite on the lab wiki, find it on Google Scholar, click on “More>Cite” and copy the MLA format.
  18. Code style in Oncinfo lab: We follow Hadley Wickhams’s R Style Guide unless another convention is mentioned below. The goal is to include as much code as possible on 1 page so that it is easier to skim while keeping the overall structure such as proper indentation.
    When writing R code, use “x ← 5” for assigning a value to a variable. Do NOT use “x = 5” or “x←5”. Do NOT use underscore, “_”, in variable or function names. Instead of “inverse_of”, use “inverseOf” as a variable name so that you can select it by 1 click. Use “inverse.of” as a function name to indicate it is a function not a variable. Almost all functions must return a list so that extending them will be easy. Use “##” for comments NOT a single “#”. Write the name of the loaded object in a comment in front of load(). Avoid long lines of code. Most lines should be < 90 characters, and all lines must be <120 characters . Thus, do NOT include space when using = in function calls. Good example: average ← mean(feet/12+inches, na.rm=TRUE) ## Spaces only around “←” and after “,”. It is OK not to place a space before the parenthesis after “if(”, “for(”, and alike.
    When the line is long, it usually means you need to extract some of it and define a new variable right above that line.Data structures in R can be ordered from simple to complex as follows: number , vector, matrix, and list. Always use the simplest possible data structure, e.i., do not use a list when you can use a matrix.
  19. Never copy code, instead generalize your code and write functions. If you are copying more than a line of code, most likely you are doing something wrong.
  20. In your code, avoid using one letter variables such as i or a because they are very hard to track in the editor. Instead use ind or i1. Also, your variable name must be different from built-in functions such as ls in R.
  21. When possible, give and use column and row names to the matrixes. Also, give and use names for vectors.
  22. Do’s and don’ts when submitting papers.
  23. Make sure that your home directory and also your work directory on the cluster are at least readable to the group. E.g., In your .bashrc, set umask 007 and do the following:
    chmod -R g+rwX ~ ; cdw; chmod -R g+rwX; cds; chmod -R g+rwX
  24. If you are unfamiliar with prior, posterior, and likelihood, read about Bayesian inference.
  25. To use ref.bib bibliography in bibtex, do the following:
    a) cd proj
    b) git clone git@bitbucket.org:habilzare/refs.git
    c) At the bottom of your LaTeX document, write:
    \bibliography{\detokenize{~/proj/refs/refs}}
    d) To add a new entry, find the appropriate format using “Google Scholar Button” (see above, click on the quotation mark at the to right, and then BibTeX at the bottom) copy the entry and see if it is already in the refs.bib file. If not, add it in “its right location” (i.e., key are alphabetically ordered) and push. Use the key with the \cite command in your LaTeX file. To compile, use pdflatex, bibtex (without .tex), and pdflatex *2.
  26. If you need to ssh into the lab machine you will need to have an account on the machine already. Then you will follow the following commands: ssh <user name>@10.102.163.212
  27. Usually, the latest version of Ubuntu is missing some dependencies for the R packages Pigengene, WGCNA, and GEOquery. To easily install these dependencies, download this script, as well as pigengeneUbuntuInstall.R to a local directory. Navigate to that directory and then run:
    $ chmod +x ubuntuInstall.sh
    $ ./ubuntuInstall.sh
    $ Rscript pigengeneUbuntuInstall.R
  28. Please cc Habil on any email that is related to scientific or logistic aspects of your research in the lab, your career development activities, and communications among lab members on issues related to the lab. When you send an email to multiple people, mention the primary addressee at the top. It helps drawing the attention of the addressee, and also shows your respect to others who do not need to read your whole message.
  29. As employees of UT Health, we can get facilitated appointments with UT Health primary care physicians (call: 210-450-9090).

Some references

  1. Two machine learning bibles: Bishop (1,2 ) and Hasite et al..
  2. Biostars is a good forum, similar to Stack Overflow in structure, but focused on bioinformatics and Computational Biology.
  3. List of free online bioinformatics courses and some interesting events in Bioconductor.
  4. This is a good online course on Probabilistic Graphical Models by Daphne Koller.
  5. Bayesian Reasoning and Machine Learning, a good introductory book by David Barber.
  6. List of bioinformatics workshops.
  7. A 5-minutes introduction to next-generation sequencing video.

Fun stuff

  1. Inner Life Of A Cell.
  2. The Dark Age of the Universe, a good visualization of the big bang
  3. Pigeons can learn to diagnose breast cancer with 99% accuracy.
  4. CRISPR is an evolutionary tool for editing DNA, which reduces the time and cost of genome modification by an order of magnitude.