## Friday, March 11, 2011

### Regex, Pattern and Matcher

Need to know about Regex specifics.

#### Iteration

?The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).
*The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).
+The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).
{n}Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.
Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).
{n,m}Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).
 Brackets, Ranges and Negation [ ] Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9. - The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c). NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9. ^ The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z. NOTE: Spaces, or in this case the lack of them, between ranges are very important.

 Positioning ^ The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla. \$ The \$ (dollar) means look only at the end of the target string, for example, fox\$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'. . The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character.

#### More...

() The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together.
| The | (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a|e)y will find 'gray' or 'grey'.

Characters
x The character x
`\\` The backslash character
`\0`n The character with octal value `0`n (0 `<=` n `<=` 7)
`\0`nn The character with octal value `0`nn (0 `<=` n `<=` 7)
`\0`mnn The character with octal value `0`mnn (0 `<=` m `<=` 3, 0 `<=` n `<=` 7)
`\x`hh The character with hexadecimal value `0x`hh
`\u`hhhh The character with hexadecimal value `0x`hhhh
`\t` The tab character (`'\u0009'`)
`\n` The newline (line feed) character (`'\u000A'`)
`\r` The carriage-return character (`'\u000D'`)
`\f` The form-feed character (`'\u000C'`)
`\a` The alert (bell) character (`'\u0007'`)
`\e` The escape character (`'\u001B'`)
`\c`x The control character corresponding to x

Predefined Character Classes
`.` Any character (may or may not match line terminators)
`\d` A digit: `[0-9]`
`\D` A non-digit: `[^0-9]`
`\s` A whitespace character: `[ \t\n\x0B\f\r]`
`\S` A non-whitespace character: `[^\s]`
`\w` A word character: `[a-zA-Z_0-9]`
`\W` A non-word character: `[^\w]`

I used this site as a reference http://www.zytrax.com/tech/web/regex.htm. They have cool Regular Expresion tester: http://www.zytrax.com/tech/web/regex.htm#parenthesis

Java Code to use Pattern and Matcher.

.....
import java.util.regex.Matcher;
import java.util.regex.Pattern;
....
String value ="mysortofString23%%%";
Pattern p = Pattern.compile("[a-zA-Z]+"); //matches if contains characters
Matcher m = p.matcher(value);
boolean result = m.find();
//the result is TRUE

Examples for Regex patterns.

^[0-9]+    Matches number only
......
(I will add more as I use them more.)