|
Prepared by
Damian Counsell,
Institute of Cancer Research, UK
General
What is Bioinformatics?
Roughly, bioinformatics describes any use of
computers to handle biological information. In practice the
definition used by most people is narrower; bioinformatics to them
is a synonym for "computational molecular biology"--- the use of
computers to characterise the molecular components of living things.
The Tight definition
"Classical" bioinformatics
Fredj Tekaia at the Institut Pasteur offers
this definition of bioinformatics:
"The mathematical, statistical and
computing methods that aim to solve biological problems using DNA
and amino acid sequences and related information."
I would say most biologists talk about
"doing bioinformatics" when they use computers to store,
retrieve, analyse or predict the
composition or the structure of biomolecules.
These molecules include your genetic material---nucleic acids---and
the products of your genes: proteins. These activities are
what I would refer to as "classical" bioinformatics, dealing
primarily with sequence analysis.
It is a mathematically interesting property
of most large biological molecules that they are polymers;
ordered chains of simpler molecular modules called
monomers. Think of them as beads or building blocks
which, despite having different colours and shapes, all have the
same thickness and the same way of connecting to one another. Each
monomer molecule is of the same general class, but each kind of
monomer has its own well-defined set of characteristics. Many
monomer molecules can be joined together to form a single, far
larger, macromolecule which has exquisitely
specific informational content and/or chemical properties.
According to this scheme, the monomers in a
given macromolecule of DNA or protein can be treated computationally
as letters of an alphabet, put together in
pre-programmed arrangements to carry messages or do work in a cell.
"New" bioinformatics
The greatest achievement of bioinformatics
methods, the Human Genome
Project, is currently being completed. Because of this the
nature and priorities of bioinformatics research and applications
are changing. People often talk portentously of our living in the "post-genomic"
era. My personal view is that this will affect bioinformatics in
several ways:
- Now we possess multiple whole genomes we
can look for differences and similarities between all the genes of
multiple species. From such studies we can draw particular
conclusions about species and general ones about evolution. This
kind of science is often referred to as comparative
genomics.
- There are now technologies designed to
measure the relative number of copies of a genetic message (levels
of gene expression) at different stages in development or disease
or in different tissues. Such technologies, such as DNA microarrays will grow in
importance.
- Other, more direct, large-scale ways of
identifing gene functions and associations (for example
yeast two-hybrid methods) will grow in significance and with
them the accompanying bioinformatics of functional
genomics.
- There will be a general shift in emphasis
(of sequence analysis especially) from genes themselves to gene
products. This will lead to:
- attempts to catalogue the activities
and characterize interactions between all gene products (in
humans):
proteomics ).
- attempts to crystallize and or predict
the structures of all proteins (in humans): structural
genomics.
- fewer DNA double-helices in bad sci-fi
movies.
- What some people refer to as
research or
medical informatics, the management of all
biomedical experimental data associated with particular
molecules---from mass spectroscopy, to in vitro assays to
clinical side-effects---will move from the concern of those
working in drug company and hospital I.T. (information technology)
into the mainstream of cell and molecular biology and migrate from
the commercial and clinical to academic sectors.
This FAQ concentrates on classical bioinformatics, but
will, I hope, grow to cover more of the "post-genomic" aspects of the field. It
is worth noting that all of the above non-classical areas of research depend
upon established sequence analysis techniques.
Overview of most common bioinformatics
programs
Everyday bioinformatics is done with
sequence search programs like BLAST,
sequence analysis programs, like the EMBOSS and Staden packages,
structure prediction programs like THREADER
or PHD
or molecular imaging/modelling programs like RasMol and WHATIF.
Overview of most common bioinformatics
technology
Currently, a lot of bioinformatics work is
concerned with the technology of databases. These databases include
both "public" repositories of gene data like GenBank
or the Protein DataBank (the
PDB), and private databases like those used by research groups
involved in gene mapping projects or those held by biotech
companies. Making such databases accessible via open standards like
the Web is very important since consumers of bioinformatics data use
a range of computer platforms: from the more powerful and forbidding
UNIX boxes favoured by the developers and curators to the far
friendlier Macs often found populating the labs of computer-wary
biologists.
Databases of existing sequencing data can be
used to identify homologues of new molecules that have been
amplified and sequenced in the lab. The property of sharing a common
ancestor,
homology, can be a very powerful indicator in
bioinformatics (see below).
Acquisition of sequence data
Bioinformatics tools can be used to obtain
sequences of genes or proteins of interest, either from material
obtained, labelled, prepared and examined in electric fields by
individual researchers/groups or from repositories of sequences from
previously investigated material.
Analysis of data
Both types of sequence can then be analysed
in many ways with bioinformatics tools.
They can be assembled. Note that
this is one of the occasions when the meaning of a biological term
differs markedly from a computational one (see the amusing confusion
over the issue at Web-based geek forum Slashdot). Computer scientists,
banish from your mind any thought of assembly language. Sequencing
can only be performed for relatively short stretches of a
biomolecule and finished sequences are therefore prepared by
arranging overlapping "reads" of monomers (single beads on
a molecular chain) into a single continuous passage of "code".
This is the bioinformatic sense of assembly.
They can be mapped (see note)---that
is, their sequences can be parsed to find sites where so-called
"restriction enzymes" will cut them.
They can be compared, usually by
aligning corresponding segments and looking for matching and
mismatching letters in their sequences. Genes or proteins which are
sufficiently similar are likely to be related and are therefore said
to be "homologous" to each other---the whole truth is rather more
complicated than this. Such cousins are called "homologues".
If a homologue (a related molecule) exists
then a newly discovered protein may be modelled---that is the three
dimensional structure of the gene product can be predicted without
doing laboratory experiments.
Bioinformatics is used in primer design.
Primers are short sequences needed to make many copies of (amplify)
a piece of DNA as used in PCR (the Polymerase
Chain Reaction).
Bioinformatics is used to attempt to
predict the function of actual gene products.
Information about the similarity, and, by
implication, the relatedness of proteins is used to trace the
"family trees" of different molecules through evolutionary
time.
There are various other applications of
computer analysis to sequence data, but, with so much raw data being
generated by the Human Genome Project and other initiatives in
biology, computers are presently essential for many biologists just
to manage their day-to-day results
Molecular modelling / structural biology is
a growing field which can be considered part of bioinformatics.
There are, for example, tools which allow you (often via the Net) to
make pretty good
predictions of the secondary structure of proteins arising
from a given amino acid sequence, often based on known "solved"
structures and other sequenced molecules acquired by structural
biologists.
Structural biologists use "bioinformatics"
to handle the vast and complex data from X-ray crystallography,
nuclear magnetic resonance (NMR) and electron microscopy
investigations and create the 3-D models of molecules that seem to
be everywhere in the media.
note
Unfortunately the word "map"
is used in several different ways in
biology/genetics/bioinformatics. The definition given above is the
one most frequently used in this context, but a gene can be said to
be "mapped" when its parent chromosome has been identified, when its
physical or genetic distance from other genes is established
and---less frequently---when the structure and locations of its
various coding components (its "exons") are established.
The Loose definition
There are other fields---for example medical
imaging / image analysis which might be considered part of
bioinformatics. There is also a whole other discipline of
biologically-inspired computation;
genetic algorithms, AI, neural networks. Often these areas
interact in strange ways. Neural networks, inspired by crude models
of the functioning of nerve cells in the brain, are used in a
program called PHD to predict, surprisingly accurately, the
secondary structures of proteins from their primary sequences.
What almost all bioinformatics has in common
is the processing of large amounts of biologically-derived
information, whether DNA sequences or breast X-rays.
How old is the
discipline?
"How old is bioinformatics?" The answer to
this one depends on which source you choose to read.
From T K Attwood and D J Parry-Smith's
"Introduction to Bioinformatics", Prentice-Hall 1999 [Longman Higher
Education; ISBN 0582327881]:
"The term bioinformatics is used to
encompass almost all computer applications in biological sciences,
but was originally coined in the mid-1980s for the analysis of
biological sequence data."
From Mark S. Boguski's article in the
"Trends Guide to Bioinformatics" Elsevier, Trends Supplement 1998
p1:
"The term `bioinformatics' is a relatively
recent invention, not appearing in the literature until 1991 and
then only in the context of the emergence of electronic
publishing...
"...However, some of my role models when I
was a graduate student (Margaret O. Dayhoff, Russell F. Doolittle,
Walter M. Fitch and Andrew D. McLachlan) had been building
databases, developing algorithms and making biological discoveries
by sequence analysis since the 1960s---long before anyone thought
to label this activity with a special term (if anything it was
called `molecular evolution'). Even a relatively new kid on the
block, the National Center for Biotechnology Information (NCBI),
is celebrating its 10th anniversary this year, having been written
into existence by US Congressman Claude Pepper and President
Ronald Reagan in 1988. So bioinformatics has, in fact, been in
existence for more than 30 years and is now middle-aged."
Resources
Can you
recommend any bioinformatics books?
It's notoriously difficult to find any books
on bioinformatics itself that cater well for all of those coming
from computing, from mathematics and from biology backgrounds. The
few textbooks available in the field tend to be eyewateringly
expensive as well. I've divided suggested reading into books of general
interest, those
best suited to people coming from a computational/mathematical
background and books for
biologists interested in bioinformatics. After my suggestions
are some links to other lists of bioinformatics books.
General introductions
Many people are curious about the Human
Genome (Project). The completion of the first draft probably
represents bioinformatics' coming of age as a discipline. The first
couple of books are aimed at the intelligent layperson.
A gossipy and insightful account of the race
to sequence the genome can be found in "The Sequence"
by Kevin Davies [Weidenfeld; ISBN 0297646982]. Matt Ridley's "Genome"
[Fourth Estate; ISBN 185702835X] is both an interesting layperson's
introduction to the issues raised by the bioinformatic revolution
and an overview of its biology and enormous scope. If you are a
non-biological scientist (or a non-scientist) and are hooked by
these, why not go back to the "real beginning" of the race and read
James Watson's entertaining and indiscreet memoir of his and Francis
Crick's determination of the structure of DNA, "The Double
Helix" [Penguin; ISBN 0140268774]---now updated with an
introduction by media don Steve Jones.
Nigel Barber at Peterborough Regional
College in the UK recommends Gary Zweiger's "Transducing the Genome"
[McGraw-Hill Professional Publishing: ISBN 0071369805]. The summary
at Amazon makes it sound a tad pretentious, but all the reviews seem
pretty positive so it might be worth a read.
Computational/Mathematical aspects
If you are a hardcore maths/computing person
Michael Waterman's
"Introduction to Computational Biology" [Chapman &
Hall/CRC Statistics and Mathematics; ISBN 0412993910] and Pavel
Pevzner's "Computational Molecular Biology - An Algorithmic
Approach" [The MIT Press (A Bradford Book); ISBN 0262161974]
will give you all the discrete maths you can shake a stick at, but
perfunctory introductions to the biology.
Bioinformatics.org's very own Jeff Bizzarro
recommends Dan Gusfield's "Algorithms on Strings, Trees and
Sequences" [Cambridge, 1997 ISBN 0-52158-519-8], Richard
Durbin, S. Eddy, A. Krogh, G. Mitchison "Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids"
[Cambridge, 1997 ISBN 0-52162-971-3] (which I think is one of the
clearest and most comprehensive guides to alignment algorithms)
and---for that full "computers-to-biology conversion"--- Geoffrey M.
Cooper "The Cell: A Molecular Approach" [ASM Press,
1996 ISBN 0-87893-119-8]. Jeff Ames writes that a second edition of
this book is now available [Sinauer Associates, Incorporated, 2000
ISBN 0-87893-106-6] and that this version---if you can find it in
the shops---comes with a CD.
Applying bioinformatics to biological research
If you're coming to the subject as a
computer user with a biological background, looking to exploit the
many tools available, you might want to try Terry Attwood and David
Parry-Smith's
"Introduction to Bioinformatics" [Longman Higher
Education; ISBN 0582327881], or Des Higgins and Willie Taylor's
"Bioinformatics: Sequence Structure and Databanks"
[Oxford University Press; ISBN 0199637903]. Bioinformatics.org also
recommends Cynthia Gibas and Per Jambeck's "Developing
Bioinformatics Skills" [O'Reilly, 2001 ISBN 1-56592-664-1].
Further suggestions for this section are
welcome.
Other
lists of bioinformatics books
See also
compbiology.org's
list and Steve
Brenner's
list.
What
bioinformatics sites are there?
Tutorials
A great place to start, whether you come
from a biological, physical or computational background is at Martin
Vingron's superb online bioinformatics tutorial. (Begin by
choosing a section from the left-hand-side menu bar.)
I recently stumbled upon a promising set of
online lecture notes currently under construction by B. Steipe
at the
Genzentrum (Gene Center)
at the Ludwig-Maximilians-Universität
München (University of Munich).
Chemistry for all
A defiantly frames-free chemistry
tutorial site.
Mathematics for biologists
First of all, an almost completely
painless introduction to the horrors of the quadratic equation
by Peter Whalen, James Walker, and Drew Marticorena.
C. J. Schwarz of the
Department of Statistics and Acturial Science,
Simon Fraser University has produced a course in "Statistics for
the Life Sciences" which is accompanied by set of sound,
online html handouts. They aren't the prettiest, but they'e some
of the best. (Though his "paradigm of statistics" mnemonic "TRRGET"
is completely inconsistent with his explanation of what the letters
stand for... If anyone can
enlighten me I'd be pleased to know what I'm failing to
understand.)
Here is a great guide
to a whole array of statistical learning/teaching resources prepared
by Juha Puranen of
the University of Helsinki (English).
Computers for biologists
Programming for biologists
General introduction to biology for
computer scientists
Estrella Mountain Community College in the States offers this
excellent
short introduction to biology (actually "The Nature of Science
and Biology". It's a great place for keyboard jockeys to start their
journey to enlightenment.
Molecular biology for computer scientists
The Institute of Arable Crop Research
Beginner's Guide to Molecular Biology
Protein chemistry for computer
scientists
Unilever Education Advanced Series
tutorial on proteins.
Cell biology for computer scientists
The
University of Arizona has made available a high-quality tutorial
in cell biology. Not only does it cover the facts, but it also
attempts to introduce some of the philosophy of the
field---recommended. Even better, it's also available en Español.
Once you've worked your way through that you
might like to see some scanning electron microscope images of
some of the structures you've read about taken by members of John Heuser's
lab.
Evolution for computer scientists
Bob Patterson maintains his "Darwiniana" with
amazing diligence.
Practical bioinformatics
Other lists of bioinformatics
tutorials
Societies
Humberto Ortiz Zuazaga kindly introduced me
to The International Society for Computational
Biology which he points out "has links to programs of study and
online courses in computational biology and to job postings".
Collections of Tools
I cannot recommend strongly enough the Human
Genome Mapping Project Resource Centre's "GenomeWeb".
Of historical interest only now, I guess, is
the legendary "Pedro's
Molecular Biology Search and Analysis Tools".
Portals
CCP11 (Collaborative Computational Project
11) is another great product of the UK's Genome Campus. To quote
their Web site, it was...
"...established to foster the broad
bioinformatics community and the UK research community in
particular. Its purpose is to facilitate the transfer of knowledge
and expertise through conferences, workshops, a newsletter and the
use of the world wide web. CCP11 is funded by the BBSRC and is hosted at the MRC
Human Genome Mapping Project Resource Centre HGMP-RC located on the
Wellcome Trust Genome Campus,
Cambridge."
Jennifer Steinbachs runs compbiology.org which is a
general computational biology site as well as being a portal to her
own work.
BioPlanet is well worth visiting, though I have to say I have no
idea who runs it or what its precise status (commercial, personal,
for-fun) as a Web site is.
Careers
How can I
get involved?
If you want to get involved in
bioinformatics, now is an exciting time. I can honestly say this is
one area of science where demand for skilled practitioners (and
salaries) can be very high.
This section is opinionated, partly because
there are people in the field, both computer scientists and
biologists, who I would love to provoke (or convert). If you are a
newcomer, and especially if you come from one of bioinformatics
component pure disciplines, I hope my ranted warnings will help you
to avoid the mistakes of your predecessors---and I write as one of
the mistaken.
David S. Roos put it well in his recent
review in the journal
Science:
"Lack of familiarity with the intellectual
questions that motivate each side can also lead to
misunderstandings. For example, writing a computer program that
assembles overlapping expressed sequence tags (EST) sequences may
be of great importance to the biologist without breaking any new
ground in computer science. Similarly, proving that it is
impossible to determine a globally optimal phylogenetic tree under
certain conditions may constitute a significant finding in
computer science, while being of little practical use to the
biologist."
How can I get involved?---I am a "newbie"
If you are a high school student / sixth
former, think about taking an interdisciplinary computational
biology or bioinformatics bachelor's degree of the sort offered at,
for example, Manchester University in the UK or UPenn in the States.
Don't worry if you can't find a place on such a course or there
isn't one nearby; perhaps the best way to approach this subject is
from two sides. Do a batchelor's degree in one area while taking a
healthy interest in the other---or (if you can afford to) complement
a first degree in one part of the discipline with a second degree in
the second
If you already have a degree in a biological
discipline there are similar Master's courses---both
interdisciplinary (e.g. Birkbeck's in London) and conversion
type courses---for biologists or others to learn computer science,
for example.
If you are currently doing a computer
science or biology PhD, try to take advantage of the opportunity to
take courses in the "other" discipline.
How can I get involved?---I am a biologist
To a biologist I would say: take as many
real computing courses as you can. It's important not just to
learn a programming language, but also to learn the discipline
of computing; to structure and document your work in a rigorous way.
What courses you take might be directed by the kind of work you are
interested in doing when you graduate---whether you see yourself
supporting bioinformatics applications or building them. For the
former you need all-round familiarity with the programs themselves
and the hardware and software needed to run them---plus your
existing understanding of biology. For the latter you need to learn
a structured programming language and the principles of good program
design---plus the ability to talk to and understand biologists.
Courses biologists might consider taking:
- UNIX
-
Of all the computing courses available it
is most important that you have a proper introduction to the UNIX
operating system. Most current bioinformatics software (especially
the free stuff) runs on "open" platforms like UNIX and the Web.
UNIX is elegant, powerful and frustrating. Master it and you will
save a lot of time.
- Mathematics
-
Learn some maths. Basic statistics,
logic/set theory and a little calculus would be my recommendation.
Many practising biologists have little or no grasp of elementary
concepts like statistical significance, permutations and
combinations and the principles of good experimental design. Logic
will come in handy at the very least if you want to query
databases in an intelligent way.
- Programming
-
If you're interested in development, learn
a real programming language: Pascal, C(++), Java or Fortran.
Perl and HTML are the stuff that holds the
Web together. A grasp of these is essential for a lot of the
Web/database work being done by many bioinformaticists at the
moment.
Good old BASIC can be very useful as an
introduction to programming or as a tool in its own right, but
none of these latter languages is built to crunch numbers and
tackle real world biological problems---which isn't to say people
don't try...
How can I get involved?---I am a
computational/quantitative scientist
One thing that I will emphasise repeatedly
in this section is the simple value of doing some "proper"
biological laboratory science. I have sat through talk after talk
where a bioinformatics "scientist" describes in great detail how his
(it's usually "his") whizzy new application of a trendy mathematical
tool offers a supposed insight into a (sometimes supposed)
biological problem. Nine times out of ten I know that this
"solution" will never be so much as sneezed on by a practising
biologist.
Quantitative scientists talk about their
interest in studying some aspect of "God's mind". Biologists are
interested in "Mother Nature's body". If you want to win Nature over
you are going to have to meet her in the flesh. You are as likely to
be useful to biologists working in isolation at the keyboard as you
are to conceive with your clothes on. Desk-bound bioinformaticists
have written code that has turned out to be popular with
biologists, but almost always because they have collaborated with
biologists.
Courses quantitative scientists might
consider taking:
- Molecular biology
-
"MoBi" was the bioinformatics of its day;
desperately fashionable, the province of new, higher-paid
practitioners and considered with slight suspicion by more
traditional biologists. It was once a great achievement to
sequence a modest stretch of DNA, now it's a job for robots. Today
we the technology is very well established. Scientists can buy
molecular biology kits to perform the sort of genetic
manipulations that would make your parents' jaws drop. Some of the
kits are so simple your
parents' parents could use them (with a modest amount of
training and supervision).
Despite the profusion of commercial kits,
there is still a requirement for real skill in molecular biology
and the general level of scientific understanding required to be a
good biological scientist---rather than just completing a
practical class---doesn't come easy. Living matter, the stuff you
have to work with is unpredictable and responds slowly---except
when it's dying. Even supposedly fast-growing bacteria can take a
long time to yield up their secrets.
Even now, as the focus of biomedical
research shifts from molecular biology back to cell biology and
protein biochemistry, it's well worth offering yourself up as a
volunteer for some vacation work in a molecular biology lab. The
term is now more often used to refer to the technological tools it
provides biology in general rather than to fundamental research in
the field itself. Those tools are common to a vast array of
different kinds of research, from archaeology to zoology.
- Protein (bio)chemistry
-
Protein (bio)chemistry is experiencing a
revival. Proteins are still more delicate and fussy than nucleic
acids. The same advice that applies to molecular biology applies
to protein biochemistry. That stuff bioinformatics people refer to
as "wet lab science" is much harder than it looks.
You might find it more difficult to get
access to a good protein lab than a good molecular biology lab and
do protein science with real wizards, but the very least you can
do is read about the theoretical aspects of the subject.
For insights into the principles of
proteins structure, try, for example, Carl Branden and John
Tooze's "Introduction to Protein Structure" [Garland ISBN
0-8153-2305-0]. Physicists in particular might find the lack of
general unifying principles in this area overwhelming.
Unfortunately there's no substitute for acquiring a "feel" from
the subject by examining a lot of examples. Still the most
critical stages in the successful prediction of protein structure
from sequence are those requiring human intervention.
Thomas E. Creighton has been responsible
for a range of standard texts on protein chemistry. If you are
working in a protein lab you are likely to come across his
"Protein Function : A Practical Approach" [ISBN 019963615X] and
the rather more expensive and theoretical "Proteins : Structures
and Molecular Properties" [ISBN 071677030X]
- Evolutionary biology
-
It's a worn quote, but worth repeating:
"The mechanisms that bring evolution
about certainly need study and clarification. There are no
alternatives to evolution as history that can withstand critical
examination. Yet we are constantly learning new and important
facts about evolutionary mechanisms. Nothing in biology
makes sense except in the light of evolution."
Theodosius Dobzhansky in "American
Biology Teacher" vol.35
Darwin's theory is one of the simplest and
most misunderstood in science. Start with a good layperson's
introduction, Richard Dawkin's "The Selfish Gene" (and remember:
it's a
metaphor, stupid) or Steve Jones' paraphrasing of
Darwin's original "The Origin of the Species" "Almost Like a
Whale". All biologists agree on the underlying principles, but
they are nearly ready to kill one another over the details. After
reading a decent book on evolutionary biology you should have at
least a handful of good questions. Now you are ready to take a
class in the subject. Take your questions with you. You'll
probably start an argument---or a fight.
You might also like to peruse Cynthia
Gibas's answers
to similar questions from computational scientists on the O'Reilly Web site.
These damned biologists are making me use
Word instead of LaTeX to write up---what can I do?
Try
this.
More general advice
Use the software
Get access to an installation of EMBOSS
and/or Staden and get someone to lead you through the tools
available. RasMol is a simple,
but powerful and elegant molecular imaging program which can teach
you a great deal about biological macromolecules; try a tutorial.
Get out on the Web and do some productive surfing for a
change :-) . The best starting point is the Human Genome Mapping
Project Resource Centre's "GenomeWeb". There's
so much stuff out there -- and most of it is free to
academics.
Where can I
study Bioinformatics?
I am gradually building this section up. Its
focus is on complete, full-time degree programmes rather than on
individual study modules. Curating a list of the latter would be a
full-time job. You can go to other places, however, if you are
looking for short courses. Thanks to various
contributors, including Wentian Li who pointed me to this list at
Rockefeller which is mirrored at various other sites. And to
Humberto Ortiz Zuazaga for mailing me a link to the ICSB, where you
can find this list. In
the UK the wonderful CCP11 project maintains
(among many other resources) lists of (mainly) British Masters
and PhDs
in bioinformatics. If you have any suggestions or updates please
contact me with them. You can publicize your course and offer a
public service at the same time.
Africa
South African National Bioinformatics
Institute (SANBI) Honours
Bioinformatics Course at the University of the Western Cape.
Next year the same institute will be offering a Master's in
bioinformatics---thanks to Cathal Seoighe.
If you know of any other bioinformatics
courses on the African continent please feel free to
mail me about them.
The Americas
Canada
The
University of Waterloo,
Department of Computer Science offers
undergraduate and
graduate courses in bioinformatics.
California
In apparent contradiction to the the URL,
the Keck Graduate Institute claims that
computational biology is a core element of the curriculum in its Master of
Bioscience degree.
Stanford
University M.S./PhD. in
BioMedical Informatics
University of
California, Irvine
Informatics in Biology and Medicine
David Delong wrote to me to point out that
the College of Natural and Agricultural
Sciences at the University of
California, Riverside is developing a "Center in
Genomics and Bioinformatics" which will offer a PhD curriculum
in genomics and bioinformatics from academic year 2001-2002 onwards.
Catherine Velazquez says that the University of California, Santa Cruz
will
start a new undergraduate BS course in
bioinformatics
in the fall of 2001. They also have made public their proposal for
an MS
in Bioinformatics.
Georgia
Georgia
Institute of Technology
Masters of Science in Bioinformatics
Maine
The Jackson
Lab, a World centre of mouse genome informatics offers a graduate
training program.
Massachusetts
Boston
University and North Eastern University offer a graduate programme in
bioinformatics.
Mexico
At the National Autonomous University of
Mexico a
doctoral program in biomedical sciences is available. Their
Computational Molecular Biology Group is
here.
Minnesota
The
University of Minnesota offers a
graduate programme in bioinformatics
New York State
Rochester
Institute of Technology
Bachelor's and Masters of Science in Bioinformatics
If you know of any other bioinformatics
courses on the American continent please feel free to
mail me about them.
North Carolina
The North
Carolina State University
Genomic Sciences program offers Masters and PhDs in
Bioinformatics.
Asia
India
According to Rahul Agrawal, the Indian Institute of Technology
Delhi, New Delhi provides courses in Biochemical
Engineering and Biotechnology. He adds that another branch of
the Institute, IIT
Kharagpur also provides various
courses in this area.
There is an
Advanced (Graduate) Diploma in Bioinformatics in the Bioinformatics Centre at the Jawaharlal Nehru University.
Madurai
Kamaraj University in Madurai, India claims to have been the
first in the country to initiate a bioinformatics programme and
advanced diploma in bioinformatics at its
School of Biotechnology
The University of Pune, Maharashtra offers
an Advanced Diploma in
Bioinformatics at the Bioinformatics Centre, , India.
Singapore
The
Bioinformatics Centre of the
National University of Singapore offers Undergraduate and
PhD programmes in conjunction with the life sciences departments
and research institutions at NUS.
If you know of any other bioinformatics
courses is Asia please feel free to
mail me about them.
Australasia
Australia
As of 2001
Flinders University in Adelaide offers a
Batchelor's of Science in Bioinformatics.
The
Research School of Biological Sciences, at the Australian National University in
Canberra offers PhD., MSc. and
Honours programs in Bioinformatics.
The
University of New South Wales in Sydney offers an
undergraduate program in
Bioinformatics.
The Biochemistry Department of La Trobe University Wales in
Victoria also offers an
undergraduate course in
Bioinformatics.
If you know of any other bioinformatics
courses is Australasia please feel free to
mail me about them.
Europe
Denmark
The Technical
University of Denmark, Center
for Biological Sequence Analysis offers MSc.-level and
PhD.-level
courses in bioinformatics.
Finland
The Finnish
Graduate School in Computational Biology, Bioinformatics, and
Biometry or "ComBi"
is a joint venture of the University of Helsinki (English), the University of Turku (English) and the University of Tampere (English).
Germany
The
Technische Fakultät (Faculty
of Technology) at Universität Bielefeld (Bielefeld
University), offers a graduate programme in
Bioinformatik (bioinformatics).
The
Universität Tübingen (University
of Tübingen) also offers
Bioinformatik. Here are their own
Frequently Asked Questions (in German only) about studying
bioinformatics there.
Sweden
Bjorn Olsson writes that, as well as a
4-year Master's Degree in Bioinformatics, the
University of Skövde offers a number of short courses and allow
computer science master's students to include bioinformatics in
their degree. There is more information
here.
Apart from this, adds Daniel Nilsson, there
is only one other "pure" bioinformatics course in Sweden: the MSc in
Bioinformatics Engineering in Uppsala. There are also
opportunities to study bioinformatics on the "normal" biotech
courses in Gothemburg
Linköping
and Umeå. The
former, The School of
Mathematical and Computing Sciences at Chalmers offers an MSc.
programme in bioinformatics---thanks to Samuel Hargestam.
United Kingdom
In the UK, there are only two dedicated
undergraduate courses in bioinformatics---one
at the University of Birmingham
and
another at UMIST. A
major problem is the desperate skills shortage in the area. Experts
in the field can earn considerably more in high-status commercial or
government research jobs than in universities---without having to
dedicate time to teaching. Bioinformatics is the ideal postgraduate
scientific subject, best suited to those who are already trained in
one of its constituent disciplines.
Two pioneering university institutions are Birkbeck College in the University
of London, a British centre with a proud tradition in educating
working and/or mature students to the highest academic standards and
a superb X-ray crystallography group and York University whose Department
of Biology offers
Masters courses and PhDs
in both computational biology and biomolecular science. Other
universities have bioinformatics groups actively involved in the
teaching of their biology/molecular biology undergraduate courses,
including, for example, courses at Leeds University where there are
also MRes
studentships available. Manchester University also teaches bioinformatics to
its undergraduates as well as offering a taught MSc. course
in the subject. University College London (UCL) also offers a final
year undergraduate course: "Bioinformatics:Genes,
Proteins and Computers".
Imperial
College recently displaced Oxford (at least
temporarily) from second place of various "charts" of the "best"
universities in the UK. [Disclaimer: I was a graduate student at
Imperial and will be lecturing there in 2001-2.] From next year the
Department of Biochemistry at Imperial is offering a new
MSc in Computational Genetics and Bioinformatics. (Oxford itself
hasn't yet deigned to recognize the field with a degree course.
[Disclaimer: I was an undergraduate there.])
Thank you to
David Parkinson for pointing out to me that for the past two
years Sheffield Hallam University has offered an
MSc/PGDip in Bioinformatics at its Graduate School
in Science, Engineering and Technology.
Other UK Bioinformatics courses include: University of Exeter MSc/MRes in
Bioinformatics.
University
of Liverpool M.Sc.,
Postgraduate Diploma and Postgraduate Certificate in Biosystems &
Informatics
University of Nottingham
Master of Philosophy in Molecular Biology with Bioinformatics
If you know of any other bioinformatics
courses is Europe please feel free to
mail me about them.
Where can I find
Bioinformatics jobs?
Start with the appointments / careers
sections of the the major scientific journals, or, better, search
their Web jobs pages with "bioinformatics":
Appropriately for a Web-dependent
discipline, there are a variety of specialist commercial Web sites
which carry bioinformatics jobs:
There are also a number of companies
actively recruiting in the area:
Practical tips
This section includes some simple
rules-of-thumb to apply when performing common bioinformatics tasks.
I try to give a reference to a more detailed source of guidance
where I know of one.
How do I
find a sequence?
The most common task in bioinformatics must
be the acquisition of some bioinformatics data on which to operate.
Usually this in the form of a nucleic acid or protein sequence,
stored as characters in the appropriate alphabet together with a
header of related information: for example some kind of unique
identifying number the species from which the original biological
substrate was obtained, the names of any authors who published the
sequence and so on.
You may have already generated your own
sequence data experimentally. In this case you are likely to want to
find sequences which are identical or similar (and therefore
possibly related) to yours. The task is then one of similarity
search.
...I have a description.
A paradoxical problem generated by the
success of the bioinformatics revolution is the increasing
difficulty of navigating the huge amount of data available. Once you
could print out most of the existing sequence databases onto paper
and cram them into a single binder. Now a search for "actin" alone
will pull out hundreds and hundreds of sequences. The key to find
what you want is to develop your own discriminatory skills rather
than rely on computers to figure out what it is you're really
after.
Use PubMed
Make sure you are clear about your aim
first. If you are looking for a sequence for a specific scientific
purpose then you might be best to start with a relevant
human-generated publication. For example, you have cloned a gene
which is part of a well-characterised biochemical pathway and you
want to find other sequences of the same functional gene product in
other species (orthologues) PubMed is your
friend. [XXXX CONTINUE DETAILED ADVICE HERE]
Use Swiss Prot
[XXXX INSERT DETAILED ADVICE HERE]
Use Boolean logic
[XXXX INSERT DETAILED ADVICE HERE]
Use cunning
[XXXX INSERT DETAILED ADVICE HERE]
...I have an accession number.
[XXXX INSERT DETAILED SEQUENCE ADVICE HERE]
...I have an another sequence.
This section will be expanded---and there
will be a more basic and detailed explanation for novice searchers,
but, in the meantime, here are the top tips cribbed from the
excellent
paper by Hugh B. Nicholas Jr., David W Deerfield II and
Alexander J. Ropelewski in BioTechniques.
- Use a local favourite program on the Web
server of your choice.
- Use at least two and preferably three
similarity tables.
- If using Smith-Waterman or FASTA
algorithms ensure that the gap opening penalty is high enough.
- If the initial search finds no or
insufficient matches repeat it with a highly diverged matrix
and/or with a Smith-Waterman-based server.
- If this doesn't work try switching from a
PAM matrix to a BLOSUM matrix.
...I'm not sure whether or not to use the defaults.
Hugh, David and Alexander again on when not
to use the default search parameters provided by a server.
- ...when the homologues you are looking
for to match your query are highly diverged.
- ...when the query or matches are short.
- ...when you are only interested in a
specific (in the sense of "species") subset of database matches
with a particular evolutionary relationship to your sequence of
interest---a relationship not implied by the default settings.
How can
I align two sequences?
This section will also be expanded for
newbies, until then, here are Hugh, David and Alexander's tips for
alignment:
- Use an appropriately divergent matrix
(I'll be adding a table soon to explain this).
- Reduce your gap penalty relative to that
you used for your database search.
- Use the MaxSegs/Waterman-Eggert version
of the dynamic programming algorithm to provide the best local
alignment and also to search for repeats.
How can I predict the function of a gene (product)?
[XXXX INSERT FUNCTION PREDICTION ADVICE
HERE]
How can I predict the structure of a sequence?
[XXXX INSERT STRUCTURE PREDICTION ADVICE
HERE]
How can I
write up?
Go
here to download some detailed advice. Go here for
more links.
Glossary
of bioinformatics terms
Here I attempt to define some common terms
in bioinformatics. I have tried to balance clarity, brevity and
rigour. Let me know if I let one of these priorities over-ride the
others.
What is an alignment?
When two symbolic representations of DNA or
protein sequences are arranged next to one another so that their
most similar elements are juxtaposed they are said to be
aligned. Many bioinformatics tasks depend upon successful
alignments. Alignments are conventionally shown as a traces.
In a symbolic sequence each base or residue
monomer in each sequence is represented by a letter. The convention
is to print the single-letter codes for the constituent monomers in
order in a fixed font (from the N-most to C-most end of the protein
sequence in question or from 5' to 3' of a nucleic acid molecule).
This is based on the assumption that the combined monomers evenly
spaced along the single dimension of the molecule's primary
structure. From now on I shall refer to an alignment of two protein
sequences.
Every element in a trace is either a
match or a
gap. Where a residue in one of two aligned
sequences is identical to its counterpart in the other the
corresponding amino-acid letter codes in the two sequences are
vertically aligned in the trace: a match. When a residue in one
sequence seems to have been deleted since the assumed divergence of
the sequence from its counterpart, its ``absence'' is labelled by a
dash in the derived sequence. When a residue appears to have been
inserted to produce a longer sequence a dash appears opposite in the
unaugmented sequence. Since these dashes represent ``gaps'' in one
or other sequence, the action of inserting such spacers is known as
gapping.
A deletion in one sequence is symmetric with
an insertion in the other. When one sequence is gapped relative to
another a deletion in sequence a can be seen as an insertion
in sequence b. Indeed, the two types of mutation are referred
to together as
indels. If we imagine that at some point one of the
sequences was identical to its primitive homologue, then a trace can
represent the three ways divergence could occur (at that point).
Biological interpretation of an alignment
A trace can represent a substitution:
AKVAIL AKIAIL
A trace can represent a deletion:
VCGMD VC-GD
A trace can represent a insertion:
GS-K GSGK
For obvious reasons I do not represent a
silent mutation.
Traces may represent recent genetic changes
which obscure older changes. Here I have only represented point
mutations for simplicity. Actual mutations often insert or delete
several residues.
What is a DNA array?
[INSERT FULL DEFINITION HERE.]
What is a homologue?
[INSERT FULL DEFINITION HERE.]
What is a scoring matrix?
[INSERT FULL DEFINITION HERE.]
Acknowledgements
Questions
Thanks to the following people for
questions:
- Jonathan Després
- Michael Wentzel
Links
Thanks to the following people for links and
sources:
- Anuradha Acharya
- Rahul Agrawal
- Jeff Ames
- Nigel Barber
- David Delong
- Samuel Hargestam
- Darren Lee
- Wentian Li
- Steve Masticola
- Daniel Nilsson
- Bjorn Olsson
- David Parkinson
- G. Deepak Reddy
- John Rowland
- Cathal Seoighe
- Jennifer Steinbachs
- James Thompson
- Humberto Ortiz Zuazaga
- Catherine Velazquez
- Zuthur Yew
- Michael Zuker
Answers
Thanks to the following people for
suggesting answers:
- Paul Boardman
- Sangeeta Sawant
- Fredj Tekaia
|