Developer Tips: Regex, Pattern and Matcher

Need to know about Regex specifics.

Metacharacter

Meaning

Iteration
?	The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).
*	The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).
+	The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).
{n}	Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Note: The - (dash) in this case, because it is outside the square brackets, is a literal. Value is enclosed in braces (curly brackets).
{n,m}	Matches the preceding character at least n times but not more than m times, for example, 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'. Values are enclosed in braces (curly brackets).

Brackets,	Ranges and Negation
[ ]	Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.
-	The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c). NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^	The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z. NOTE: Spaces, or in this case the lack of them, between ranges are very important.

Positioning
^	The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.
$	The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'.
.	The . (period) means any character(s) in this position, for example, ton. will find tons, tone and tonneau but not wanton because it has no following character.

More...
()	The ( (open parenthesis) and ) (close parenthesis) may be used to group (or bind) parts of our search expression together.
\|	The \| (vertical bar or pipe) is called alternation in techspeak and means find the left hand OR right values, for example, gr(a\|e)y will find 'gray' or 'grey'.

Characters
x	The character x
`\\`	The backslash character
`\0`n	The character with octal value `0`n (0 `<=` n `<=` 7)
`\0`nn	The character with octal value `0`nn (0 `<=` n `<=` 7)
`\0`mnn	The character with octal value `0`mnn (0 `<=` m `<=` 3, 0 `<=` n `<=` 7)
`\x`hh	The character with hexadecimal value `0x`hh
`\u`hhhh	The character with hexadecimal value `0x`hhhh
`\t`	The tab character (`'\u0009'`)
`\n`	The newline (line feed) character (`'\u000A'`)
`\r`	The carriage-return character (`'\u000D'`)
`\f`	The form-feed character (`'\u000C'`)
`\a`	The alert (bell) character (`'\u0007'`)
`\e`	The escape character (`'\u001B'`)
`\c`x	The control character corresponding to x

Predefined Character Classes
`.`	Any character (may or may not match line terminators)
`\d`	A digit: `[0-9]`
`\D`	A non-digit: `[^0-9]`
`\s`	A whitespace character: `[ \t\n\x0B\f\r]`
`\S`	A non-whitespace character: `[^\s]`
`\w`	A word character: `[a-zA-Z_0-9]`
`\W`	A non-word character: `[^\w]`

I used this site as a reference http://www.zytrax.com/tech/web/regex.htm. They have cool Regular Expresion tester: http://www.zytrax.com/tech/web/regex.htm#parenthesis

Java Code to use Pattern and Matcher.

.....
import java.util.regex.Matcher;
import java.util.regex.Pattern;
....
                String value ="mysortofString23%%%";
                Pattern p = Pattern.compile("[a-zA-Z]+"); //matches if contains characters
                Matcher m = p.matcher(value);
                boolean result = m.find();
                //the result is TRUE

Examples for Regex patterns.

^[0-9]+ Matches number only
......

(I will add more as I use them more.)

Developer Tips

Friday, March 11, 2011

Regex, Pattern and Matcher

Metacharacter

Meaning

Iteration

More...

No comments:

Post a Comment