Python regex cheatsheet: 20 essential patterns explained

Q: What about Unicode classes like \p{L}?

The standard library re module does not support \p{...} property classes at all, so trying one just raises an error. The regex module on PyPI does, and it is a drop-in replacement. If I am only validating ASCII, plain re is fine; the moment real Unicode is on the table, I install regex.

This Python regex cheatsheet collects the 20 patterns I actually reach for in 2026, straight from the re module. Parsing logs at 2 a.m. when something is on fire, validating forms nobody appreciates until they break, the unglamorous work I keep coming back to. Each pattern gets the raw string, what it really does, an example I have run, the edge case that is going to bite you, and the library I would grab when regex stops being worth the fight. I test these on Python 3.12 and up. They follow habits I learned the slow, painful way: raw strings every time, anchors when validating, non-capturing groups when I am only grouping.

The short answer

A Python regex cheatsheet of 20 tested re patterns for email, IPv4 and IPv6, URL, phone, ISO 8601, UUID, CSV split, log lines, slugs and more. Each one ships with the raw string, a matching example, and the edge case that will trip you. When validating, anchor with ^ and $; drop the anchors for re.search or re.findall.

20patterns that cover most jobs

restandard library, no install

3.12+tested on this Python

Answer card listing a Python regex cheatsheet of 20 tested re patterns, each with its edge case. — Twenty patterns that cover most of the day-to-day. Every card carries the regex, an example, and the trap. PNG

How to read the cards

Every card has the same shape. The raw re pattern, what I am using it for, an input that matches, the edge case I left out on purpose (and why it will trip you), then where I would bail to once regex stops earning its keep. One thing to watch. When I am validating, I anchor everything with ^ and $, so the whole string has to match end to end. Want to pull these out of a bigger blob of text with re.search or re.findall? Drop the anchors first. Otherwise you will sit there wondering why nothing matches, which, yeah, I have done.

The 20 patterns

1. Email validation

r"^[\w.+-]+@[\w-]+(\.[\w-]+)+$"

What it does: Catches the everyday address. Letters, digits, dots either side of the at sign, plus and dash allowed too, and a domain carrying at least one dot. This is the one I actually ship.

Example: alice.smith+work@example.co.uk matches.

Edge case: It won't touch RFC 5321 quoted local parts or the weirder internationalised domains. And honestly? That's fine. Chase a full RFC parser inside a regex and you'll lose, I promise. This one covers what your app needs.

Alternative: email.utils.parseaddr if you just want canonical parsing. The email-validator PyPI package when you genuinely need RFC compliance.

2. IPv4 address

r"^(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)$"

What it does: Checks each octet really sits in 0-255, so junk like 999.0.0.1 or 1.2.3.4.5 gets bounced. That octet range is the whole reason this thing looks so ugly.

Example: 192.168.1.42 matches; 256.0.0.1 does not.

Edge case: It happily swallows leading zeros like 010.0.0.1. Quiet little trap, that one. Some parsers read 010 as octal and suddenly you're pointed at a host you never meant to hit. Strip the zeros first if that's in scope.

Alternative: For pure validation I'd just hand the string to ipaddress.IPv4Address(s) and catch the ValueError. Cleaner, and it can't lie to you the way the regex can.

3. IPv6 address (simplified)

r"^(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}$"

What it does: Matches a fully written-out IPv6 address. All 8 hex groups, colons between them, nothing folded down.

Example: 2001:0db8:85a3:0000:0000:8a2e:0370:7334 matches.

Edge case: Here's the catch, and it's a big one. Real IPv6 in the wild almost never gets written out in full. The compressed :: form slips past it. So does IPv4-mapped ::ffff:1.2.3.4 and the %eth0 zone suffix. I'd reach for this only when I already know the input is expanded.

Alternative: ipaddress.IPv6Address(s) just gets every legal form right. For IPv6 I don't even bother with regex anymore.

4. URL (http / https)

r"^https?://[\w.-]+(?:\.[a-zA-Z]{2,})+(?:[/?#][^\s]*)?$"

What it does: Grabs http and https URLs that carry a real TLD, with the optional path, query and fragment hanging off the tail.

Example: https://example.com/path?id=42#section matches.

Edge case: It only knows http and https. Throw ftp or mailto or a data: URI at it and you get nothing back. It doesn't understand user:pass@ userinfo or IDN domains either. Fine for a link checker. Wrong for anything that has to swallow arbitrary URLs.

Alternative: urllib.parse.urlparse(s) never fails, it just parses. So check the scheme and netloc it hands back and decide for yourself.

5. Phone number in E.164

r"^\+[1-9]\d{6,14}$"

What it does: Checks the E.164 shape. A plus sign, then a country code that can't start with zero, then 6 to 14 digits. That's the format you want sitting in a database.

Example: +33612345678 matches.

Edge case: It checks the shape, not reality. +19999999999 sails right through, and no human could ever dial that. If "looks like a phone number" is enough for you, ship it. Just don't go telling your users it's verified, because it isn't.

Alternative: When I actually need to trust the number, I reach for Google's phonenumbers package. It knows the per-country length and prefix rules the regex never could.

6. ISO 8601 datetime

r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:\d{2})?$"

What it does: Matches the canonical ISO 8601 timestamp, fractional seconds and timezone offset both optional. The format APIs hand you all day long.

Example: 2026-05-27T14:30:00.123+02:00 matches.

Edge case: The regex counts digits, not calendars. So 2026-02-30 slides right through. February has opinions about that. The pattern does not share them. If a bad date can reach you, validate after the match, not instead of it.

Alternative: On 3.11+ I just call datetime.fromisoformat(s). It eats every ISO 8601 variant and rejects the dates that can't exist. That's what I'd do.

7. Date DD/MM/YYYY (European)

r"^(0[1-9]|[12]\d|3[01])/(0[1-9]|1[0-2])/\d{4}$"

What it does: Validates European DD/MM/YYYY. Keeps the day and month inside sane ranges instead of letting any two random digits through.

Example: 27/05/2026 matches; 32/05/2026 does not.

Edge case: Same trap as the ISO one. 31/02/2026 matches, because the pattern has no clue February stops at 28 or 29. If the calendar matters to you, finish the job with datetime.strptime(s, "%d/%m/%Y").

Alternative: datetime.strptime throws ValueError on a date that can't exist. Which is exactly what you want for strict validation.

8. Time HH:MM:SS (24-hour)

r"^(?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"

What it does: Validates a 24-hour clock. Hours 00-23, minutes and seconds capped at 00-59. No 25:00 nonsense gets past it.

Example: 14:30:00 matches; 25:00:00 does not.

Edge case: It rejects leap seconds (23:59:60), which a handful of timestamp standards actually allow. You'll probably never hit one. But if you're parsing data from a system that emits them, you'll be quietly dropping valid rows and never know why.

Alternative: datetime.time.fromisoformat when you want parsing stricter than a regex that's only counting digits.

9. UUID v4 (random)

r"^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$"

What it does: Checks the v4 layout specifically. The version nibble has to be 4, and the variant nibble has to land on 8, 9, a or b. So it won't wave through a random v1 UUID dressed up to look the part.

Example: 550e8400-e29b-41d4-a716-446655440000 matches.

Edge case: Lowercase only. Hand it an uppercase UUID and it just shrugs. This one bites me more than I'd like to admit. Add re.IGNORECASE or switch to [0-9a-fA-F] and move on with your day.

Alternative: uuid.UUID(s) doesn't care about version or case. It just tells you whether it's a UUID, full stop.

10. Hex colour code

r"^#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$"

What it does: Matches CSS hex colours in the lengths you actually paste. 3-digit shorthand (#f00), 6-digit (#ff0000), plus 8-digit RGBA (#ff0000ff).

Example: #1e293b matches.

Edge case: It skips the 4-digit shorthand RGBA (#f00f), which, honestly, I forget exists about half the time. If your design tokens lean on it, drop a {4} into the alternation and you're covered.

Alternative: Modern CSS also lets you write named colours and function syntax like rgb() and oklch(). Regex won't keep up with all that. If you need it covered, parse the CSS properly.

Python REPL showing import re and re.findall extracting digit groups from a string. — Drop the anchors and re.findall pulls every match out of a larger blob. The REPL is the fastest way to sanity-check a pattern. PNG

11. CSV row split with quoted fields

r'(?:^|,)("(?:[^"]|"")*"|[^,]*)'

What it does: Walks the fields of a comma-separated line and copes with the annoying part. Double-quoted fields that hide commas inside them, and the doubled "" escape on top.

Example: alice,"smith, jr.",42 yields three captures.

Edge case: Two gaps, and you'll feel them fast. It can't deal with a newline inside a quoted field, the kind where one CSV record spills across several lines. It also leaves the surrounding quotes stuck on the captures for you to strip yourself. Real CSV does both of those constantly.

Alternative: Look, just use csv.reader. It's in the stdlib, it handles every one of these edge cases, and I only touch the regex when pulling in csv would feel like silly overkill.

12. Apache or nginx common log entry

r'^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\S+)'

What it does: Pulls the seven fields out of a common-log line: client IP, timestamp, method, path, protocol, status, bytes sent. This is the one I've pasted into more throwaway scripts than anything else on this list.

Example: 127.0.0.1 - - [27/May/2026:14:30:00 +0200] "GET /index.html HTTP/1.1" 200 1234 matches.

Edge case: It stops at the common format. So the combined-format extras, referer and user-agent, just fall off the end. Most servers I touch log combined anyway, so I tack on "([^"]*)" "([^"]*)" and grab those two as well.

Alternative: Fine for a one-off grep. But once you're parsing logs at any real volume, ship them to something like Loki with a proper parser pipeline. A hand-rolled regex per server doesn't scale, and you'll resent maintaining it within a month.

13. Whitespace collapse

re.sub(r"\s+", " ", text).strip()

What it does: Squashes every run of whitespace, tabs and newlines included, down to one space, then trims the ends. My go-to one-liner for cleaning up OCR output or scraped HTML. Also whatever a user just pasted into a box.

Example: " Hello\t\n world " becomes "Hello world".

Edge case: Run it on raw HTML and it'll cheerfully flatten the whitespace inside your <pre> blocks too, which mangles your code samples. Point it at text content only. Never the markup.

Alternative: " ".join(text.split()) does the exact same thing with no regex at all, and it's a touch faster on short strings. Half the time that's what I actually reach for.

14. Markdown link extraction

r"\[([^\]]+)\]\(([^)]+)\)"

What it does: Splits a Markdown link into its two useful halves. The visible label in group 1, the URL in group 2, out of something like [label](https://example.com).

Example: See [the docs](https://docs.example.com) yields label "the docs" and URL "https://docs.example.com".

Edge case: It breaks the second a bracket shows up in the label or a paren shows up in the URL. And Wikipedia URLs are absolutely riddled with parentheses, so this fails way more often than you'd guess. It can't count nested brackets. No regex really can, that's not a flaw you can patch out.

Alternative: For Markdown you didn't write yourself, hand the job to a real parser like mistune or markdown-it-py and stop guessing.

15. JSON number

r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?"

What it does: Matches a number the way the JSON spec actually defines it. Optional sign, no sneaky leading zero, then an optional decimal and exponent. It mirrors the grammar, which is the entire point.

Example: -1.23e-4 matches; 007 does not.

Edge case: It won't match Infinity or NaN or a hex literal. That's deliberate, because none of those are legal JSON to begin with. If your "JSON" is carrying them around, something upstream is already lying to you.

Alternative: When it's a whole document instead of a stray token, skip the regex and let json.loads(s) do the parsing.

16. Python identifier (ASCII-only)

r"^[A-Za-z_][A-Za-z0-9_]*$"

What it does: Checks the classic ASCII-only variable name. Starts with a letter or underscore, then letters or digits or underscores the rest of the way down.

Example: my_var2 matches.

Edge case: Modern Python is perfectly happy with Unicode names like élève or π, and this pattern slams the door on every single one. So if you're accepting Unicode, this regex will reject code that runs absolutely fine. Annoying.

Alternative: Skip the regex entirely and call "name".isidentifier(). It's the official check, and it already knows the Unicode rules cold.

17. Unix file path

r"^/(?:[^/\x00]+/)*[^/\x00]*$"

What it does: Matches an absolute Unix path. Forward slashes, and no null bytes hiding in the segments.

Example: /home/user/file.txt matches.

Edge case: Read this one twice if you care about security. It happily accepts .. and . segments, which is exactly how a path-traversal attack climbs out of the directory you meant to confine it to. "It matched my path regex" is not a security check, full stop. Run the thing through pathlib.Path.resolve() and confirm where it really lands before you trust a byte of it.

Alternative: pathlib.PurePosixPath(s) parses the string without ever touching the filesystem.

18. Windows file path

r'^[A-Za-z]:\\(?:[^\\/:*?"<>|]+\\)*[^\\/:*?"<>|]*$'

What it does: Matches an absolute Windows path. Drive letter, backslash separators, and it throws out the characters Windows forbids in a filename.

Example: C:\Users\admin\file.txt matches.

Edge case: Windows paths are a swamp. This pattern misses UNC shares (\\server\share) and the long-path prefix (\\?\). It also ignores the fact that Windows quietly accepts forward slashes. I've been burned by all three, separately, on bad days. Cover only the cases you're sure you'll actually see.

Alternative: pathlib.PureWindowsPath(s) parses without going anywhere near the disk, and it knows the quirks this regex doesn't.

19. Strong password

r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^A-Za-z0-9]).{12,}$"

What it does: Enforces the usual composition rules with four lookaheads. At least 12 characters, and one each of lowercase, uppercase, a digit, plus a symbol.

Example: P@ssw0rd-StrongEnough matches.

Edge case: Here's the uncomfortable truth, and I might be in the minority on how loudly to say it. Ticking these boxes has almost nothing to do with whether a password is actually strong. P@ssw0rd1234 passes every rule here and sits in basically every breach dump on the internet. Composition rules measure how obedient a password is. Entropy is a separate question. Always pair this with a check against a known leaked-password list.

Alternative: zxcvbn-python actually estimates how guessable a password is, and it spots the dictionary words and keyboard walks a regex will never see.

20. URL-friendly slug

r"^[a-z0-9]+(?:-[a-z0-9]+)*$"

What it does: Validates a clean lowercase slug. Alphanumeric groups joined by single hyphens, with no hyphen sitting at the start or the end, and none doubled up through the middle.

Example: my-awesome-post-2026 matches; --bad-- does not.

Edge case: It validates a slug. It doesn't make one. Feed it "Café Brûlé" and it just says no, because the accents stop it cold. You've got to normalise with unicodedata.normalize first to strip the diacritics, then validate the result.

Alternative: For actually generating slugs I don't reinvent any of this. python-slugify handles the normalising and transliteration and length-capping in one single call.

Patterns to avoid in 2026

A few "classic" regexes keep getting copy-pasted out of decade-old StackOverflow answers, and every one has burned me at least once. Take the "perfect email" regex, that 1,000-character monster trying to be RFC 5322 on a single line. It rejects real addresses that deliver mail just fine. Use a short pragmatic pattern, then send a confirmation email and be done. The inbox is the only validator that actually counts. Then there's the "URL that matches everything", with every scheme and userinfo and IDN bolted on. It's brittle, and urllib.parse.urlparse beats it on both speed and correctness anyway. And the "credit-card validator" that sniffs out the card brands by their number prefixes? That one's worse than useless. It hands you a warm feeling of security you didn't earn. The Luhn check tells you a number is well-formed. Only the bank tells you it's real, and a bare regex can't even manage the first part.

Performance notes for high-volume regex

Compile once, reuse forever. That's the one habit that pays off every single time. Put re.compile(pattern) up at module scope and call PATTERN.match(s) inside your hot loop. Yes, re caches compiled patterns internally already. But leaning on that cache in a tight loop still measurably loses to a plain module-level constant, and the intent reads clearer anyway. Once you're into millions of matches per second, reach for the regex PyPI package. Its possessive quantifiers and atomic groups shut down catastrophic backtracking before it can hang the whole process. And if you've got a fixed set of patterns and don't need named captures, the hyperscan bindings can hand you roughly 100x. I save that one for when the profiler actually points there. Not a moment before, because it's a real dependency to carry.

Sources and further reading

Frequently asked questions

Why use raw strings (r"...") for every pattern?

Because backslashes mean something to both Python and the regex engine, and that double meaning is exactly where the bugs hide. Without the r prefix, a sequence like \n turns into a real newline before the regex even gets a look. The raw prefix makes your pattern read just like the docs you lifted it from, so I put r on every pattern with no exceptions.

When is regex the wrong tool?

Anytime the thing nests. HTML, JSON, anything with a recursive grammar cannot be parsed correctly by regex, and that is the actual math, not me being purist. Dates, paths and UUIDs already have a library each (datetime, pathlib, uuid) that is stronger and reads better. Where regex earns its keep is the flat, well-defined stuff.

How do I avoid catastrophic backtracking?

Start with non-capturing groups (?:...) when you do not need the capture. Possessive quantifiers from the regex module kill backtracking on the greedy stretches that tend to blow up. Then attack your own pattern on purpose: feed it the letter a repeated thirty-odd times and watch. If it hangs on that toy input, it will hang on a hostile request in production.

Does Python regex support look-behind?

Yes. Fixed-width look-behind (?<=...) has been around forever, and variable-width landed in Python 3.7. Look-aheads (?=...) have always worked too. One piece of advice: go easy on them. Stacking three look-arounds into one pattern is usually the day you should have split it into two plain passes.

What about Unicode classes like \p{L}?