We need to use regular expression frequently in text processing for search, parse, validation or XML document integrity. Java provide us a package called java.util.regex to make life easier for regular expression. Bellow I have summarized the things as I need to use Java Regex so frequently 🙂

Common matching symbols:

Regular Expression Description
. Matches any sign
^regex regex must match at the beginning of the line
regex$ Finds regex must match at the end of the line
[abc] Set definition, can match the letter a or b or c
[abc[vz]] Set definition, can match a or b or c followed by either v or z
[^abc] When a “^” appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[a-d1-7] Ranges, letter between a and d and figures from 1 to 7, will not match d1
X|Z Finds X or Z
XZ Finds X directly followed by Z
$ Checks if a line end follows

Metacharacters:

Regular Expression Description
\d Any digit, short for [0-9]
\D A non-digit, short for [^0-9]
\s A whitespace character, short for [ \t\n\x0b\r\f]
\S A non-whitespace character, for short for [^\s]
\w A word character, short for [a-zA-Z_0-9]
\W A non-word character [^\w]
\S+ Several non-whitespace characters

Characters:

Characters Description
x The character x
\\ The backslash character
n The character with octal value 0n (0<=n<=7)
nn The character with octal value 0nn (0<=n<=7)
mnn The character with octal value 0mnn (0<=m<=3, 0<=n<=7)
\xhh The character with hexadecimal value 0xhh
\uhhhh The character with hexadecimal value 0xhhhh
\t The tab character ('\u0009')
\n The newline (line feed) character ('\u000A')
\r The carriage-return character ('\u000D')
\f The form-feed character ('\u000C')
\a The alert (bell) character ('\u0007')
\e The escape character ('\u001B')
\cx The control character corresponding to x

Quantifier:

Regular Expression Description Examples
* Occurs zero or more times, is short for {0,} X* – Finds no or several letter X, .* – any character sequence
+ Occurs one or more times, is short for {1,} X+ – Finds one or several letter X
? Occurs no or one times, ? is short for {0,1} X? -Finds no or exactly one letter X
{X} Occurs X number of times, {} describes the order of the preceding liberal \d{3} – Three digits, .{10} – any character sequence of length 10
{X,Y} .Occurs between X and Y times, \d{1,4}- \d must occur at least once and at a maximum of four
*? ? after a qualifier makes it a “reluctant quantifier”, it tries to find the smallest match.

A simple example for case insensitive URL matching using java Regex given bellow:

java.util.regex.Pattern p = Pattern.compile(“(\\s*|^)((ht|f)tp(s?)://)?([\\w-]+\\.)+[\\w-]+(/[\\w-./?%&=]*)?(\\s*|$)”,Pattern.CASE_INSENSITIVE);
for getting URL with port number we can apply the following pattern:
(\\s*|^)(((ht|f)tp(s?)://(www)?)|www)((?:[a-z0-9.-]|%[0-9A-F]{2}){3,})(?::(\\d+))?((?:\\/(?:[a-z0-9-._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)(?:\\?((?:[a-z0-9-._~!$&'()*+,;=:\\/?@]|%[0-9A-F]{2})*))?(?:#((?:[a-z0-9-._~!$&'()*+,;=:\\/?@]|%[0-9A-F]{2})*))?(\\s*|$)
java.util.regex.Matcher m = p.matcher(content);

while (m.find()) {

//do the required action

//String matchString= m.group();

}

we can use group function of Matcher in java regx to retrieve the matched text:

A group is a pair of parentheses used to group subpatterns. For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example,

input: abbc

pattern: a(b*)c

causes the substring bb to be captured by the group (b*). A pattern can have more than one group and the groups can be nested. For example,

pattern: (a(b*))+(c*)

contains three groups:

group 1: (a(b*))
group 2: (b*)
group 3: (c*)

The groups are numbered from left to right, outside to inside. There is an implicit group 0, which contains the entire match. Here is an example of what is captured in groups. Notice that group 1 was applied twice, once to the input abb and then to the input ab. Only the most recent match is captured. Note that when using * on a group and the group matches zero times, the group will not be cleared. In particular, it will hold the most recently captured text. For example,

input: aba

pattern: (a(b)*)+ group 0: aba group 1: a group 2: b

Group 1 first matched ab capturing b in group 2. Group 1 then matched the a with group 2 matching zero bs, therefore leaving intact the previously captured b.

Note: If it is not necessary for a group to capture text, you should use a non-capturing group since it is more efficient. This example demonstrates how to retrieve the text in a group.

CharSequence inputStr = "abbabcd";
String patternStr = "(a(b*))+(c*)";

// Compile and use regular expression
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.find();

if (matchFound) {
    // Get all groups for this match
    for (int i=0; i<=matcher.groupCount(); i++) {
        String groupStr = matcher.group(i);
    }
}

Sources: www.vogella.de www.exampledepot.com java.sun.com

Regular Expression Pocket Reference Regular Expressions for Perl, Ruby, PHP, Python, C, Java and .NET (Pocket Reference (O’Reilly)) copy