Understanding Regex 101 - An Introduction to Regex and its Basic Methods/Symbols

Rachel Kim Sep 28, 2019 7:57:00 AM

Intro to Regular Expressions/Regex

As a software developer, you’ve probably encountered regular expressions several times and were confused when seeing this daunting set of characters grouped together like this:

And you may have wondered what this gibberish means…

Regular expressions (Regex or Regexp) are extremely useful in stepping up your algorithm game and will make you a better problem solver. The structure of regular expressions can be intimidating at first, but it is so rewarding once you grasp the patterns and implement them in your work properly.

What is Regex?

A Regular Expression, commonly referred to as Regex, is a powerful tool used for searching, validating, and manipulating text strings. It is essentially a sequence of characters that defines a search pattern. Regex is supported in numerous programming languages, including scripting languages like Perl, Python, PHP, and JavaScript, as well as general-purpose languages such as Java. Even word processors, such as Microsoft Word, support regex for advanced text searching. The true strength of regex lies in its ability to perform complex pattern matching and text manipulation tasks with concise expressions, often replacing dozens of lines of traditional programming code.

Regex Syntax Basics

Regex syntax consists of a sequence of characters, metacharacters, and quantifiers.

Metacharacters

Metacharacters are special characters that define specific operations or behaviors in regex. Some commonly used metacharacters include:

The dot (.) matches any character except line breaks.
The caret (^) matches the start of a string or line (in multiline mode).
The dollar sign ($) matches the end of a string or line (in multiline mode).
The asterisk (*) matches zero or more occurrences of the preceding element.
The plus sign (+) matches one or more occurrences of the preceding element.
The question mark (?) matches zero or one occurrence of the preceding element.

Character Classes

A character class is a set of characters that can be matched by a regex pattern. Character classes are defined using square brackets [] and can contain a list of characters, a range of characters, or a combination of both.

For example:

[a-z] matches any lowercase letter from 'a' to 'z'.
[A-Z] matches any uppercase letter from 'A' to 'Z'.
[0-9] matches any digit from '0' to '9'.

Character classes can also be negated by using the caret symbol ^ at the beginning of the class. For instance:

[^a-zA-Z] matches any character that is not a letter (i.e., neither lowercase nor uppercase).
[^0-9] matches any character that is not a digit.

This flexibility allows you to create specific and targeted search patterns to suit various use cases.

Word Characters & Non-Word Characters

A word character in regex includes letters, digits, and underscores (_). These characters are matched using the shorthand \w. For example, \w+ matches one or more word characters, which can be helpful for finding words, variable names, or identifiers in text.

The opposite of a word character is a non-word character, which can be matched using the shorthand \W. For example, \W matches any character that is not a letter, digit, or underscore, such as punctuation or spaces.

This distinction between word and non-word characters is crucial for creating accurate and efficient regex patterns.

Quantifiers and Groups

Quantifiers are used to specify how many times a pattern should appear within the text being matched. The most common quantifiers are:

* (zero or more occurrences)
+ (one or more occurrences)
? (zero or one occurrence)

These quantifiers allow you to control the repetition of patterns in your regex, which is essential when you need to match repeating sequences or optional elements.

Groups are used to capture parts of a match for reuse or extraction. Groups are defined by enclosing a pattern in parentheses (). For example, (abc) captures the string 'abc', which can then be referenced in the same regex using \1(this refers to the first captured group).

There are two types of groups:

Unnamed Groups: The basic parentheses (abc) will capture without a name.
Named Groups: Use the syntax (?<name>pattern) to capture a group with a name, such as (?<id>\d+), making the regex more readable and the captured content more accessible.

This capability to group and reference parts of a match adds a powerful layer of flexibility to your regex patterns, especially in complex matching scenarios.

Flags

Flags modify the behavior of a regular expression and can be added after the closing slash or as the second parameter in the RegExp constructor. Here are the most commonly used flags:

g (global): The g flag makes the regex search for all occurrences of the pattern, not just the first one.
i (case-insensitive): The i flag makes the regex match letters regardless of their case, so a matches both 'a' and 'A'.
m (multiline): The m flag changes the behavior of the ^ and $ anchors to match the start and end of each line within the string, instead of the start and end of the entire string.

Example of using flags:

const regex = /pattern/gim;

In this example, the regex will:

Find all matches (g),
Ignore case differences (i),
Treat the string as multiple lines for ^ and $ matches (m).

How to Create a Regular Expression

There are two types of regular expressions you can create:

1. regexLiteral

To create a regular expression literal, you start and end with forward slashes ( /) to enclose the Regex pattern. Syntax:

/regex pattern/flags

Screen Shot 2022-04-19 at 3.51.33 PM

2. RegExP

For a RegExp constructor, this method builds the expression for you. Syntax:

new RegExp(regex pattern[, flags])

When to Use regexLiteral vs. RegExp

If your regular expression is constant and does not change its value, you should use the regex literal for better performance. In cases where it is dynamic and not a literal string (i.e., an expression), it is best to use the regex constructor (see above example).

How to Use Regular Expression Methods

There are three common Regex methods that you should be familiar with: test, match, and replace.

Regex Test Method

Let's look at an example of the test method.

In the example above, the .test method returns a boolean - checking if the string contains a regex match or no match in the search pattern.

Regex Match Method

Now instead of using RegExp.test(String) which just returns a boolean if the pattern is matched, you can use the .match method to match strings. This method returns an array with the whole matched string. Though it’s great to have the .test method check whether a Regular expression pattern is true or not, there will be times where we want to be in control of actually doing the match. That’s where the match method comes in handy! It returns an array of the match which can be helpful information depending on your use case.

Here is a very basic example below. Later on, you will see how Regex match can be a powerful tool when combining the Regex with flags.

Regex Replace Method

The .replace method searches for a string for a specified value (or regular expression) and returns a new string where the specified value is replaced.

NOTE:

You CANNOT replace multiple instances using a regular value, but you CAN do this with Regex. The example below is using a regular value.

How to Use Bracket Expressions in Regex

Inside bracket expressions, you can place any special character you want to use to specify the character sets.

For example, const regex = /[A-Z]/. Notice that A-Z is inside the square brackets. This will search for all uppercase letters in the alphabet. Here are some similar search patterns:

[a-z] matches a string that has all lowercase letters in the entire alphabet
[A-Z] matches a string that has all the uppercase letters in the entire alphabet
[abcd] matches a string that has a, b, c, d
[a-d] exactly the same as previous example so you can either specify each character or group them
[a-gA-C0-7] matches string that has lowercase letters a-g, uppercase letters A-C, or numbers 0-7
[^a-zA-Z] matches a string that DOES NOT have all lowercase or uppercase letters

*Inside a character set, the ^ character means all the characters that are NOT in the a-z or A-Z.

How to Use Flags with Regular Expressions

After we end with a slash character, we can either choose one specific flag or combine them. Regex uses flags to be more specific on how to properly find and match the defined custom characters.

Before we go into the specific flags, you should keep in mind that flags are optional like the example below:

Without flags, Regex will find the first character that returns true in an array within the slashes. So in this case, our code will return: [‘T’] because it found the first uppercase letter in the sentence.

The g flag

The g in g flag stands for "global" which means it will return what is true within the entire regular expression. In other words, it will not only return after the first match, but ALL the occurrences that matched.

If we added the g flag at the end of our slash, it would return all the characters from the regular expression that is upper case.

The m flag

Let’s say we changed const to be const regex = /[a-z]/m. The m flag will be checking to see the first instance of a lowercase letter from a-z so it will return [‘h’].

As an additional side note, there are three other character classes that can help when using multiple character sets for pattern matching.

The negations of \d, \w, and \s will be \D, \W, and \S. It will find the following:

\D matches any non digit character (same as [^0-9])
\W matches any non word character (same as [^a-zA-Z0-9_])
\S matches a non whitespace character

How to Use Quantifiers In Regular Expression

Quantifiers are basic symbols in regular expressions that have a special meaning.

* matches previous item zero or more times
+ matches previous item once or more times
? matches previous item zero or one times; makes preceding item optional
^ matches the beginning of the string
$ matches the end of the string
. matches any single character (except line breaks)
{m, n} min is 0 or positive integer number that indicates minimum # of matches, and max is an integer equal to or greater than min indicating the maximum number of matches

Let’s go through this example to demonstrate our understanding of quantifiers.

You can see that the regular expression is checking all the lowercase letters from a-z and using the + symbol to match up all the previous items. So when you console log found, it will return [ ‘for’, ‘if’, ‘rof’, ‘fi’ ].

Let’s say that + symbol was not there and the Regex was only:

Then it will return [ ‘f’, ‘o’, ‘r’, ‘i’, ‘f’, ‘r’, ‘o’, ‘f’, ‘f’, ‘i’ ].

Use Case: Regex for Email Address Formatting

Remember this long string of characters we saw at the beginning of this article?

Now that we have learned the basic methods and terminologies used in Regex, let’s break down this once daunting but now understandable string of characters one step at a time.

First, let’s take a look at this Regex piece by piece. So from the beginning of the string, we have ^\w+. We can see that ^ character is simply starting off the regular expression and then checking for an alphanumeric & underscore character using the w flag. The + quantifier is there to match up the previous items. From our example, this first piece is checking the ‘student’ characters from the email: student-id@alumni.school.edu

Next, we got our second piece of the Regex broken up as ([.-]?\w)+. The opening/closing parenthesis is used as the first capturing group where inside we have a character set which will search for either a “.” character or “-” character in our email. The ? is a quantifier that matches between 0 and 1 of the preceding characters so it checks to make sure that there is only one “-” or “.” followed by the w flag. There cannot be more than one of those characters consecutively in a valid email. So this second piece represents the ‘-id’ characters from the email example. If it was ‘student–id@alumni.school.edu’ with two hyphens, this would come out to be an invalid email.

The third piece is @\w+ and this will be checking for the @ character in the given email followed by the w flag to check for any alphanumeric character. This covers for the ‘@alumni’ piece of the email. The + quantifier continues to match up the previous sections of the email address.

The following piece of ([.]?\w)+ is the same search pattern as our second piece except it’s only checking for the “.” character and alphanumeric character, excluding our “-” symbol. This represents “.school” in the email.

The next chunk (.[a-zA-Z]{2,3})+ is a crucial piece in checking an email format. This piece is for the top-level domain (TLD) of an email address. It’s the part of a domain that comes after the dot, for example - com, org, or net. This Regex will match a “.” character and another character set that will check for any lowercase and uppercase letters. The {2, 3} will be matching between 2 and 3 of the previous matches where 2 indicates the min number of matches and 3 stands for the max number of matches. So the letters can only be up to 2-3 characters. In this case, it is ‘.edu’.

Finally, we have the $ character to end our Regex string.

And that’s it! Now we know how to use Regex for a basic email validation. Additionally, you can implement brackets, flags, and/or quantifiers in your Regex to accommodate for other edge cases not considered in our Regex string.

Understanding Regex: Next Steps

Regular expressions are an essential tool for developers, offering a powerful way to search, validate, and manipulate text efficiently. Whether you're performing input validation, searching for patterns in logs, or parsing complex data formats like dates and URLs, mastering regex will greatly enhance your problem-solving skills. As you continue to explore its capabilities, you'll find regex invaluable for automating repetitive tasks, filtering large datasets, and handling diverse text-processing challenges across various programming languages and environments. Regex isn't just about matching text—it's about streamlining tasks and writing cleaner, more efficient code.

Understanding Regex 101 - An Introduction to Regex and its Basic Methods/Symbols

Intro to Regular Expressions/Regex

What is Regex?