4 ot of hat e %o hen riting programs for bioogy !an be %es!ribe% as
sear!hing for patterns in strings. $he obvio&s eampes !ome from the anaysis of bioogi!a se<&en!e %ata 5 remember that D;4, ;4 an% protein se<&en!es are 9&st strings. Many of the things e ant to oo- for in bioogi!a se<&en!es !an be %es!ribe% in terms of patterns:
• protein %omains
• D;4 trans!ription fa!tor bin%ing motifs • restri!tion enyme !&t sites
• %egenerate #C primer sites • r&ns of monon&!eoti%es
7oever, its not 9&st se<&en!e %ata that !an have interesting patterns. 4s e %is!&sse% in !hapter 3, most of the other types of %ata e have to %ea ith in bioogy !omes in the form of strings1 insi%e tet fies 5 things i-e:
• rea% mapping o!ations
• geographi!a sampe !oor%inates • taonomi! names
• gene names
• gene a!!ession n&mbers • (+4"$ sear!hes
1 ;ote that atho&gh many of the things in this ist are n&meri!a %ata, theyre sti rea% in to #ython programs as strings an% nee% to be manip&ate% as s&!h.
1B2 ChapterH:eg&arepressions
'n previo&s !hapters, eve oo-e% at some programming tas-s that invove pattern re!ognition in strings. eve seen ho to !o&nt in%ivi%&a amino a!i% resi%&es an% even gro&ps of amino a!i% resi%&es= in protein se<&en!es !hapter =, an% ho to i%entify restri!tion enyme !&t sites in D;4 se<&en!es !hapter 2=. eve aso seen ho to eamine parts of gene names an% mat!h them against in%ivi%&a !hara!ters !hapter G=.
$he !ommon theme among a these probems is that they invove sear!hing for a fixed set of !hara!ters. (&t there are many probems that e ant to sove that re<&ire more feibe patterns. For eampe:
• Riven a D;4 se<&en!e, hats the ength of the poy4 taiE
• Riven a gene a!!ession name, etra!t the part beteen the thir% !hara!ter an% the &n%ers!ore
• Riven a protein se<&en!e, %etermine if it !ontains this highyre%&n%ant %omain motif
(e!a&se these types of probems !rop &p in so many %ifferent fie%s, theres a
stan%ar% set of toos in #ython1 for %eaing ith them: regular e!pressions= eg&ar epressions2 are a topi! that might not be !overe% in a generap&rpose
programming boo-, b&t be!a&se theyre so &sef& in bioogy, ere going to %evote the hoe of this !hapter to oo-ing at them.
4tho&gh the toos for %eaing ith reg&ar epressions are b&it in to #ython, they are not ma%e a&tomati!ay avaiabe hen yo& rite a program. 'n or%er to &se them e m&st first ta- abo&t mo%&es.
1 4n% in many other ang&ages an% &tiities. 2 $he name is often abbreviate% to rege! .
1B3 ChapterH:eg&arepressions
o"ules in Python
$he f&n!tions an% %ata types that eve %is!&sse% so far in this boo- have been ones that are i-ey to be nee%e% in pretty m&!h every program 5 toos for %eaing ith strings an% n&mbers, for rea%ing an% riting fies, an% for manip&ating ists
of %ata. 4s s&!h, they are a&tomati!ay ma%e avaiabe hen e start to !reate a #ython program. 'f e ant to open a fie, e simpy rite a statement that &ses the open f&n!tion.
7oever, theres another !ategory of toos in #ython hi!h are more spe!iaie%. eg&ar epressions are one eampe, b&t there is a arge ist of spe!iaie% toos hi!h are very &sef& hen yo& nee% them1, b&t are not i-ey to be nee%e% for the
ma9ority of programs. )ampes in!&%e toos for %oing a%van!e% mathemati!a !a!&ations, for %onoa%ing %ata from the eb, for r&nning eterna programs, an% for manip&ating %ate/time information. )a!h !oe!tion of spe!iaie% toos 5 reay 9&st a !oe!tion of spe!iaie% functions an% %ata types 5 is !ae% a mo"ule. For reasons of effi!ien!y, #ython %oesnt a&tomati!ay ma-e these mo%&es
avaiabe in ea!h ne program, as it %oes ith the more basi! toos. 'nstea%, e have to epi!ity oa% ea!h mo%&e of spe!iaie% toos that e ant to &se insi%e o&r program. $o oa% a mo%&e e &se the import statement2. For eampe, the mo%&e that %eas ith reg&ar epressions is !ae% re, so if e ant to rite a program that &ses the reg&ar epression toos e m&st in!&%e the ine:
import re
at the top of o&r program. hen e then ant to &se one of the toos from a mo%&e, e have to prefi it ith the mo%&e name3. For eampe, to &se the
1 'n%ee%, this is one of the great strengths of the #ython ang&age.
2 $his is the reason for the from __future__ import diision statement that e have to in!&%e if ere &sing #ython 2.
1BB ChapterH:eg&arepressions
reg&ar epression search f&n!tion hi!h e %is!&ss ater in this !hapter= e have to rite:
re.search(pattern, string#
rather than simpy:
search(pattern, string#
'f e forget to import the mo%&e hi!h e ant to &se, or forget to in!&%e the mo%&e name as part of the f&n!tion !a, e i get a NameError.
e en!o&nter vario&s other mo%&e in the rest of this boo-. For the rest of this !hapter spe!ifi!ay, a !o%e eampes i re<&ire the import re statement in or%er to or-. For !arity, e ont in!&%e it, so if yo& ant try r&nning any of the !o%e in this !hapter, yo& nee% to a%% it at the top.
(aw strings
riting reg&ar epression patterns, as e see in the very net se!tion of this !hapter, re<&ires &s to type a ot of spe!ia !hara!ters. e!a from !hapter 2 that !ertain !ombinations of !hara!ters are interprete% by #ython to have spe!ia meaning. For eampe, \n means start a new line, an% \t means insert a tab character .
*nfort&natey, there are a imite% n&mber of spe!ia !hara!ters to go ro&n%, so some of the !hara!ters that have a spe!ia meaning in reg&ar epressions !ash ith the !hara!ters that alread* have a spe!ia meaning. #ythons ay ro&n% this
probem is to have a spe!ia r&e for strings: if e p&t the etter r imme%iatey before the opening <&otation mar-, then any spe!ia !hara!ters insi%e the string are ignore%:
1B ChapterH:eg&arepressions
print(r*tn*#
$he r stan%s for raw , hi!h is #ythons %es!ription for a string here spe!ia
!hara!ters are ignore%. ;oti!e that the r goes outside the <&otation mar-s 5 it is not part of the string itsef. e !an see from the o&tp&t that the above !o%e prints o&t the string 9&st as eve ritten it:
tn
itho&t any tabs or ne ines. o& see this spe!ia raw notation &se% in a the reg&ar epression !o%e eampes in this !hapter.