Table of Contents
Regular expressions are one of the highest-leverage skills in a developer's toolkit. A 30-character regex can replace 100 lines of string-parsing code. They power search-and-replace in every text editor, input validation in every web form, log parsing in every analytics pipeline, and data extraction in every scraper.
They also have a reputation for being write-only — cryptic to read, hard to debug, and treacherous to modify. That reputation is partly deserved for complex patterns, but it dissolves once you understand the systematic structure behind the syntax.
This guide covers every major regex concept: the engine mechanics that determine match behaviour, character classes and quantifiers, grouping and capturing, backreferences, lookaheads and lookbehinds, and the flag modifiers that change how patterns evaluate. Examples are given in both JavaScript and Python — the two most common environments where developers write regex.
All patterns in this guide can be tested immediately using the browser-based regex tester linked throughout.
How Regex Engines Work
Understanding the engine saves hours of debugging. Most regex engines used in web development (JavaScript's V8, Python's re module, PCRE used in PHP and many others) are NFA-based (Non-deterministic Finite Automaton). NFA engines try alternatives and backtrack when a path fails.
The matching process: The engine positions the pattern at the start of the input and attempts to match each element of the pattern left-to-right. If an element fails, the engine backtracks — returns to the last decision point and tries an alternative. If no alternative exists, the engine moves the starting position forward by one character and tries again from the start of the pattern.
Backtracking implications: Greedy quantifiers (*, +, ?) match as much as possible first, then backtrack to find a match. Lazy quantifiers (*?, +?, ??) match as little as possible first, then expand. Possessive quantifiers and atomic groups (supported in PCRE but not JavaScript) never backtrack — they consume and commit.
Catastrophic backtracking: Patterns like (a+)+b on a string of many a characters followed by no b cause exponential time complexity. The engine tries exponentially many combinations before failing. This is the root cause of ReDoS (Regular Expression Denial of Service) attacks. Always test regex performance on adversarial inputs before using them in server-side validation.
Anchors: ^ matches the position at the start of the string (or start of a line in multiline mode). $ matches the position at the end. \b matches a word boundary (transition between \w and \W). Anchors match positions, not characters — they consume no input.
Character Classes and Shorthands
Character classes match one character from a defined set.
Literal class: [abc] matches exactly a, b, or c. [a-z] matches any lowercase ASCII letter. [0-9] matches any ASCII digit. [a-zA-Z0-9_] matches any alphanumeric character or underscore.
Negated class: [^abc] matches any character that is NOT a, b, or c. Inside a character class, ^ as the first character negates the class.
Shorthand classes:
| Shorthand | Meaning | Equivalent |
|---|---|---|
\d | Digit | [0-9] (ASCII) |
\D | Non-digit | [^0-9] |
\w | Word character | [a-zA-Z0-9_] |
\W | Non-word character | [^a-zA-Z0-9_] |
\s | Whitespace | Space, tab, newline, carriage return |
\S | Non-whitespace | Any non-whitespace |
. | Any character | Any character except newline (unless s flag) |
Unicode note: In JavaScript with the u flag, \d, \w, and \s still match only ASCII. For Unicode-aware digit matching (including Arabic-Indic digits), use \p{Decimal_Number} with the u flag. In Python with re, \d matches Unicode digits by default; use re.ASCII flag (re.A) to restrict to ASCII.
Escaping inside character classes: Inside [], most special characters lose their special meaning. You only need to escape ], ^ (at the start), - (between characters), and \.
Quantifiers: How Many Matches
Quantifiers specify how many times the preceding element should match.
| Quantifier | Matches |
|---|---|
* | 0 or more (greedy) |
+ | 1 or more (greedy) |
? | 0 or 1 (greedy) |
{n} | Exactly n |
{n,} | n or more (greedy) |
{n,m} | Between n and m (greedy) |
Greedy vs lazy: Add ? after any quantifier to make it lazy:
.*— greedy: matches as many characters as possible before backtracking.*?— lazy: matches as few characters as possible
Example: On the input <a><b><c>, the pattern <.*> (greedy) matches the entire string <a><b><c> — it extends as far right as possible. The pattern <.*?> (lazy) matches only <a> — it stops at the first >.
Quantifier on groups: Quantifiers apply to the preceding element — which can be a group. (ab)+ matches "ab", "abab", "ababab", etc. (ab)? matches "ab" or the empty string.
Possessive quantifiers (PCRE/Java, not JavaScript): *+, ++, ?+ — like greedy but never backtrack. Used to prevent catastrophic backtracking. Not available in JavaScript regex.
Groups: Capturing, Non-Capturing, and Named
Groups serve two purposes: grouping elements for quantifiers and alternation, and capturing matched substrings for use in results or replacements.
Capturing group `(...)`: Matches the enclosed pattern and captures the matched text. Groups are numbered left-to-right by their opening parenthesis. In JavaScript: match[1] for the first group. In Python: match.group(1).
Non-capturing group `(?:...)`: Groups for structure without capturing. Use when you need grouping for a quantifier but don't need the matched text. More efficient — the engine doesn't save the match. Use by default, switch to capturing only when you need the value.
Named capturing group `(?<name>...)` (JavaScript) / `(?P<name>...)` (Python): Assigns a name to the group. Access by name instead of index: match.groups.name (JavaScript) or match.group('name') (Python). Named groups make patterns self-documenting and are robust to changes in group count.
Alternation in groups: (cat|dog) matches either "cat" or "dog". The alternation operator | has the lowest precedence — it applies to everything on either side up to the enclosing group or pattern boundary.
Nested groups: Groups can be nested. The outer group is numbered before inner groups. ((a)(b)) — group 1 captures "ab", group 2 captures "a", group 3 captures "b".
Backreferences `\1`, `\2`: A backreference matches the same text that the numbered (or named) group matched — not the same pattern, the same literal text. (\w+) \1 matches repeated words: "the the", "and and". Named backreference in JavaScript: \k<name>.
Lookaheads and Lookbehinds
Lookaround assertions match based on what surrounds a position without consuming characters. They are zero-width — like anchors, they match positions, not characters.
Positive lookahead `(?=...)`: Matches a position followed by the pattern. \d+(?= dollars) matches a number followed by " dollars" but captures only the number, not " dollars".
Negative lookahead `(?!...)`: Matches a position NOT followed by the pattern. \b(?!un)\w+ matches words that do not start with "un".
Positive lookbehind `(?<=...)`: Matches a position preceded by the pattern. (?<=\$)\d+ matches digits preceded by a dollar sign, capturing only the digits. JavaScript requires the d flag or recent V8 version for lookbehind; Python supports it fully.
Negative lookbehind `(?<!...)`: Matches a position NOT preceded by the pattern.
Practical examples:
- Password containing a digit:
^(?=.*\d).{8,}$— the lookahead checks for a digit anywhere without anchoring its position - Extract numbers from currency:
(?<=[$€£])\d+(?:\.\d{2})?— captures amount without the currency symbol - Match HTML tags without script tags:
<(?!script)[^>]+>
Limitations: JavaScript lookahead is unlimited in what it can match; lookbehind in JavaScript requires fixed-length patterns in older engines (Node.js <10). Python's lookbehind requires fixed-length patterns in all versions. PCRE allows variable-length lookbehinds.
Lookahead vs lookahead order: Multiple lookaheads at the same position are evaluated independently. (?=.*[A-Z])(?=.*\d)(?=.*[!@#]) checks three conditions (uppercase, digit, special character) at the same position — all three must be satisfied for the overall position to match.
Flags and Modifiers
Flags modify how the entire pattern evaluates. In JavaScript, flags are appended after the closing slash (/pattern/flags); in Python, they are constants passed to re.compile().
| Flag | JS | Python | Effect |
|---|---|---|---|
| Case-insensitive | i | re.I | A matches a, B matches b |
| Multiline | m | re.M | ^/$ match start/end of each line |
| Dotall | s | re.S | . matches newlines |
| Global | g | (use findall) | Find all matches, not just first |
| Unicode | u | (default) | Enable Unicode escapes (\p{}) |
| Sticky | y | — | Match only at current position |
| Verbose | — | re.X | Allow whitespace and comments in pattern |
Multiline vs dotall: These are frequently confused. m flag changes what ^ and $ match (line boundaries instead of string boundaries); it has no effect on .. The s flag changes what . matches (includes newlines); it has no effect on ^ and $. Use both if you need multiline anchors AND cross-newline dot matching.
Global flag in JavaScript: Without g, .match() returns the first match. With g, it returns all matches (but no group captures). For all matches with group captures, use .matchAll() (returns an iterator) or a g-flag regex in a while (re.exec(str)) loop.
Python verbose mode (`re.X`): Allows whitespace and # comments inside the pattern. Useful for complex patterns that need documentation:
pattern = re.compile(r"""
(?P<year>\d{4}) # 4-digit year
-
(?P<month>\d{2}) # 2-digit month
-
(?P<day>\d{2}) # 2-digit day
""", re.X)Test Python regex patterns— Named groups, re.IGNORECASE, re.MULTILINECommon Regex Patterns
Email validation (simplified): ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Note: Email is complex — this pattern rejects some valid addresses (quoted strings, IP literals) and accepts some technically invalid ones. For production use, prefer a dedicated email validation library over regex.
URL: https?://[\w.-]+(?:\.[\w.-]+)+[\w\-._~:/?#\[\]@!$&'()*+,;=]*
This is a simplified pattern — the full URL standard (RFC 3986) is too complex for a single regex.
ISO 8601 date: \d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])
Validates format only — does not check for valid days-in-month or leap years.
UUID v4: [0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}
IPv4 address: ^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
HTML tag (simplified): <([a-z][a-z0-9]*)(?:\s[^>]*)?>.*?</\1>
Warning: parsing HTML with regex is fundamentally limited — use a proper HTML parser for production.
Semantic version: (0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?
This is the official semver.org pattern.
Credit card (Luhn check NOT included): ^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12})$
Note: Format only. Always run the Luhn algorithm in addition to the pattern match.
Frequently Asked Questions
Why does my regex work in the tester but not in my code?
What is the difference between greedy and lazy quantifiers?
How do I match a literal dot (.) in regex?
Can regex match balanced brackets or nested structures?
What causes catastrophic backtracking and how do I avoid it?
Summary
Regular expressions reward investment. Once you understand the engine's NFA backtracking model, character class syntax, quantifier greediness, and the zero-width nature of lookaround assertions, the patterns stop looking like noise and start reading like precise specifications. Start with non-capturing groups by default, add named captures when you need the values, and always test on adversarial inputs before deploying server-side validation.