Writing search patterns

Writing search patterns

This section explains how to create search patterns using BBEdit's grep syntax. (For readers with prior experience, this is essentially like the syntax used for regular expressions by the Unix egrep utility, with some borrowing from Perl.)

Most characters match themselves
Most characters that you type into the Find dialog box match themselves. For instance, if you're looking for the letter "t", Grep stops and reports a match when it encounters a "t" in the text. This idea is so obvious that it seems not worth mentioning, but the important thing to remember is that these characters are search patterns. Very simple patterns, to be sure, but patterns nonetheless.

Escaping special characters
In addition to the simple character matching discussed above, there are various special characters that have different meanings when used in a grep pattern than in a normal search. (The use of these characters is covered in the following sections.)

However, sometimes you will need to include an exact, or literal, instance of these characters in your grep pattern. In this case, you must use the backslash character \ before that special character to have it be treated literally; this is known as "escaping" the special character. To search for a backslash character itself, double it \\ so that its first appearance will escape the second.

For example, you might need to re-arrange a list of labels which contains numbered items written like this: #1, #2, #3. As described in the next section, if you used the character # in your grep pattern, it would match any of the digits following the number sign, but not the number sign itself. So, you would need to write \# as part of your search pattern, to indicate that you want to match a literal number sign character rather than using its special meaning.

Note - When passing grep patterns to BBEdit via AppleScript, be aware that both the backslash and double-quote characters have special meaning to AppleScript. In order to pass these through correctly, you must escape them in your script. Thus, to pass \r for a carriage return to BBEdit, you must write \\r in your AppleScript string.

Wildcards match types of characters
These special characters, or metacharacters, are used to match certain types of other characters:

Wildcard		Matches...

.		any character except a line break (i.e. a carriage return)

#		any numeric character 0-9

^		beginning of a line (unless used in a character class)

$		end of line (unless used in a character class)

Being able to specifically match text starting at the beginning or end of a line is an especially handy feature of grep. For example, if you wanted to find every instance of a message sent by Patrick, from a log file which contains various other information like so:

From: Rich, server: barebones.com
To: BBEdit-Talk, server: lists.barebones.com
From: Patrick, server: example.barebones.com

you could search for the pattern:

^From: Patrick

and you will find every occurrence of these lines in your file (or set of files if you do a multi-file search instead).

Character classes match sets or ranges of characters
The character class construct lets you specify a set or a range of characters to match, or to ignore. A character class is constructed by placing a pair of square brackets [...] around the group or range of characters you wish to include. To exclude, or ignore, all characters specified by a character class, add a caret character ^ just after the opening bracket [^...]. For example:

Character class		Matches...

[xyz]		any one of the characters x, y, z

[^xyz]		any character except x, y, z

[a-z]		any character in the range a to z

You can use any number of characters or ranges between the brackets. Here are some examples:

Character class		matches

[aeiou]		any vowel

[^aeiou]		any character that is not a vowel

[a-zA-Z0-9]		any alphanumeric character

[^aeiou0-9]		any character that is neither a vowel nor a digit

Note - The Case Sensitive option in the Find dialog does affect set and range matches. (Previous versions of the BBEdit manual have incorrectly stated that it does not.) If Case Sensitive is enabled, then all matches are case sensitive, so for example, a character range [A-Z] will match only the capital letters A through Z. If Case Sensitive is turned off, the same character range will match both capital A through Z, and lowercase a through z.

A set pattern matches when the search encounters any one of the characters in the pattern. However, the contents of a set are only treated as separate characters, not as words. For example, if your search pattern is [beans] and the text in the window is "lima beans", BBEdit will report a match at the "a" of the word "lima".

To include the character ] in a set or a range, place it immediately after the opening bracket. To use the ^ character, place it anywhere except immediately after the opening bracket. To match a dash character (hyphen) in a range, place it at the beginning of the range; to match it as part of a set, place it at the beginning or end of the set.

Character class		matches

[]0-9]		any digit or ]

[aeiou^]		a vowel or ^

[-A-Z]		a dash or A - Z

[--A]		any character in the range from - to A

[aeiou-]		a vowel or -

Matching non-printing characters
As described in the previous section on searching, BBEdit provides several special character pairs that you can use to match certain non-printing characters. You can use these special characters in grep patterns as well as for normal searching.

(New in 6) You can now also specify any character in a grep pattern by means of its hexadecimal character code (escape code).

For example, to look for a tab or a space, you would use the set pattern [\t ] (consisting of a tab special character and a space character).

Character		Matches...

\r		line break (carriage return)

\n		Unix line break (line feed)

\t		tab

\f		page break (form feed)

\xNN		hexadecimal character code NN (e.g. \x0D for CR)

\\		backslash

Use \r to match a line break in the middle of a pattern and the special characters ^ and $ (described above) to "anchor" a pattern to the beginning of a line or to the end of a line. In the case of ^ and $, the line break character is not included in the match.

Other special characters
BBEdit also supports several Perl-like expressions for matching different types or categories of characters.

Special Character		Matches...

\s		any whitespace character (space, tab, carriage return, line feed, form feed)

\S		any non-whitespace character (any character not included by \s)

\w		any word character (a-z, A-Z, 0-9, _, and some 8-bit characters)

\W		any non-word character (all characters not included by \w, including carriage returns)

\d		any digit (0-9), same as #

\D		any non-digit character (incl. carriage return)

A "word" is defined in BBEdit as any run of non-word-break characters bounded by word breaks. Word characters are generally alphanumeric, and some characters whose value is greater than 127 are also considered word characters.

Note that any character matched by \s is by definition not a word character; thus, anything matched by \s will also be matched by \W (but not the reverse!).

Repetition metacharacters repeat other patterns
The special characters *, +, and ? specify how many times the pattern preceding them may repeat. The preceding pattern can be a literal character, a wildcard character, a character class, or a special character.

Pattern		Matches...

p*		zero or more p's

p+		one or more p's

p?		zero or one p's

Note that the repetition characters * and ? match zero or more occurrences of the pattern. That means that they will always succeed, because there will always be at least zero occurrences of any pattern, but that they will not necessarily select any text (if no occurrences of the preceding pattern are present).

Try out the following examples to see how their descriptions follow the behavior you observe:

Pattern	Text is...	Matches...

.*	Fourscore and seven years	Fourscore and seven years

[0-9]+	I've been a loyal member since 1983 or so.	1983

#+	I've got 12 years on him.	12

A*	BAAAAAAAB	advances the insertion point past the first and last "B"s, and matches "AAAAAAA"

A+	BAAAAAAAB	AAAAAAA

A?	Andy joined AAA	the "A" from Andy

A+	Ted joined AAA yesterday	"AAA" and the "a" from yesterday

Combining patterns to make complex patterns
So far, the patterns you have seen match a single character or the repetition of a single character or class of characters. This is very useful when you are looking for runs of digits or single letters, but often that's not enough.

However, by combining these patterns, you can search for more complex items. You're already familiar with combining patterns. Remember the section at beginning of this discussion that said that each individual character is a pattern that matches itself? When you search for a word, you are already combining basic patterns.

You can combine any of the preceding grep patterns in the same way. Here are some examples.

Pattern	Matches	Examples

#+\+#+	a string of digits, followed by a literal plus sign, followed by more digits	4+2 1234+5829

####[\t ]B\.C\.	four digits, followed by a tab or a space, followed by the string B.C.	2152 B.C.

\$?[0-9,]+\.#*	an optional dollar sign, followed by one or more digits and commas, followed by a period, then zero or more digits	1,234.56 $4,296,459.19 $3,5,6,4.0000 0. (oops!)

Note again in these examples how the characters that have special meaning to grep are preceded by a backslash: \+, \., and \$.

Also, as you can see from the last example, with a pattern which was intended to find dollar amounts, it's not always possible to keep a pattern from matching unexpectedly.

Creating subpatterns
Subpatterns provide a means of organizing or grouping complex grep patterns. This is primarily important for two reasons: for limiting the scope of the alternation operator (which otherwise creates an alternation of everything to its left and right), and for changing the matched text when performing replacements.

A subpattern consists of any simple or complex pattern, enclosed in a pair of parentheses:

Pattern		Matches...

(p)		the pattern p and remembers it

You can combine more than one subpattern into a grep pattern, or mix subpatterns and other pattern elements as you need.

Taking the last set of examples, you could modify these to use subpatterns wherever actual data appears:

Pattern	Matches	Examples

(#+)\+(#+)	a string of digits, followed by a plus sign, followed by more digits	4+2 1234+5829

(####)[\t ]B\.C\.	four digits, followed by a tab or a space, followed by the string B.C.	2152 B.C.

\$?([0-9,]+)\.(#*)	an optional dollar sign, followed by one or more digits and commas, followed by a period, then zero or more digits	1,234.56 $4,296,459.19 $3,5,6,4.0000 0.

We will revisit subpatterns in the section on replacement, so that you can see how the choice of subpatterns affects the changes you can make.

Warning: Be careful when using subpatterns of the form (p)*, as it is possible to create an overflow condition for BBEdit's grep engine. BBEdit now attempts to guard against this by stopping the search process if it detects a potential overflow; instead of crashing, the pattern should just fail to match. If you need to use this type of expression, we recommend using (p)+ instead. Or in particular, if you are trying to work around . not matching carriage returns, use [\s\S]* instead of (.|\r)*.

Using alternation
The alternation operator | allows you to match any of several patterns at a given point. To use this operator, place it between one or more patterns x|y to match either x or y.

As with all of the preceding options, you can combine alternation with other pattern elements to handle more complex searches.

Pattern	Text is...	Matches...

a\|t	A cat	each "a" and "t"

a\|c\|t	A cat	each "a", "c", and "t"

a (cat\|dog) is	A cat is here. A dog is here. A giraffe is here.	"A cat is", "A dog is"

A\|b+	Abba	"A", "bb", and "a"

Andy\|Ted	Andy and Ted joined AAA yesterday	"Andy" and "Ted"

####\|years	I've been a loyal member since 1983, almost 16 years ago.	"1983", "years"

[a-z]+\|#+	That's almost 16 years.	"That", "s", "almost", "16", "years"

The "longest match" issue
When creating complex patterns, you should bear in mind that the repetition characters * and + are "greedy". That is, they will always make the longest possible match possible to a given pattern, so if your pattern is E* (zero or more E's) and your text contains "EEEE", the pattern matches all the E's at once, not just the first one. This is usually what you want, but not always.

Suppose, for instance, that you want to match an HTML tag. You'd think that a good way to do this would be to search for a pattern

<.+>

consisting of a less-than sign, followed by one or more occurrences of a single character, followed by a greater-than sign. To understand why this may not work the way you think it should, consider the following sample text to be searched:

This text is in boldface.

The intent was to write a pattern that would match both of the HTML tags separately. Let's see what actually happens. The < character at the beginning of this line matches the beginning of the pattern. The next character in the pattern is . which matches any character (except a line break), modified with the + repetition operator, taken together, this combination means one or more repetitions of any character. That, of course, takes care of the B. The problem is that the next > is also "any character" and that it also qualifies as "one or more repetitions." In fact, all of the text up to the end of the line qualifies as "one or more repetitions of any character" (the line break doesn't qualify, so grep stops there). After grep has reached the line break, it has exhausted the + operator, so it backs up and sees if it can find a match for >. Lo and behold, it can: the last character is a greater-than symbol. Success!

In other words, the pattern matches our entire sample line at once, not the two separate HTML tags in it as we intended. More generally, the pattern matches all the text in a given line or paragraph from the first < to the last >. The pattern only does what we intended when there is only one HTML tag in a line or paragraph. This is what we meant when we say that * and + try to make the longest possible match.

To work around this behavior, you must modify your pattern to take advantage of additional context present in the data, if possible. Instead of matching multiple occurrences of any character, try to match any character other than the character that marks the end of the text you want to select. You can do this with the character range operator [...] by including a ^ as the first character of the range to mean any character besides the indicated characters. For example:

<[^>]+>

...matches an opening bracket, then one or more occurrences of any character other than a closing bracket, followed by a closing bracket. This achieves the results you want, preventing BBEdit from "overrunning" the closing angle bracket.

A slightly more complicated example: how could you write a pattern that matches all text between and HTML tags? Consider the sample text below:

Welcome to the home of BBEdit!

As before, you might be tempted to write:

.*

...but for the same reasons as before, this will match the entire line of text. The solution is similar. We want the * character to stop repeating when it reaches the opening < of the tag, so we tell it to match any character that's not a <.

[^<]*