I’ve leaned on Python’s re module for years. Parsing logs at 2 a.m. when something’s on fire and nobody else is awake. Validating forms that, let’s be real, nobody appreciates until they break. It’s not glamorous work. Maybe that’s why I keep coming back to it. Anyway, here are the 20 patterns I actually use in 2026. Each one gets the raw pattern, what it really does, an example I’ve run, the edge case that’s going to bite you (and one always does), plus the library I’d grab when regex stops being worth the fight. I test these on Python 3.12+. They follow habits I learned the slow, painful way: raw strings every time, anchors when I’m validating. Non-capturing groups when I’m only grouping and not capturing.
The 20 patterns
- Email validation
- IPv4 address
- IPv6 address (simplified)
- URL (http/https)
- Phone in E.164
- ISO 8601 datetime
- Date DD/MM/YYYY
- Time HH:MM:SS (24h)
- UUID v4
- Hex colour code
- CSV row split (quoted)
- Apache / nginx log entry
- Whitespace collapse
- Markdown link
- JSON number
- Python identifier
- Unix file path
- Windows file path
- Strong password
- URL-friendly slug
How to read the cards
Every card has the same shape. The raw re pattern, what I’m using it for, an input that matches, the edge case I left out on purpose (and why it’ll trip you), then where I’d bail to once regex stops earning its keep. One thing to watch. When I’m validating, I anchor everything with ^ and $, so the whole string has to match end to end. Want to pull these out of a bigger blob of text with re.search or re.findall? Drop the anchors first. Otherwise you’ll sit there wondering why nothing matches, which, yeah, I’ve done.
Email validation
r"^[\w.+-]+@[\w-]+(\.[\w-]+)+$"
What it does: Catches the everyday address. Letters, digits, dots either side of the at sign, plus and dash allowed too, and a domain carrying at least one dot. This is the one I actually ship.
Example: alice.smith+work@example.co.uk matches.
Edge case: It won’t touch RFC 5321 quoted local parts or the weirder internationalised domains. And honestly? That’s fine. Chase a full RFC parser inside a regex and you’ll lose, I promise. This one covers what your app needs.
Alternative: email.utils.parseaddr if you just want canonical parsing. The email-validator PyPI package when you genuinely need RFC compliance.
IPv4 address
r"^(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)$"
What it does: Checks each octet really sits in 0-255, so junk like 999.0.0.1 or 1.2.3.4.5 gets bounced. That octet range is the whole reason this thing looks so ugly.
Example: 192.168.1.42 matches; 256.0.0.1 does not.
Edge case: It happily swallows leading zeros like 010.0.0.1. Quiet little trap, that one. Some parsers read 010 as octal and suddenly you’re pointed at a host you never meant to hit. Strip the zeros first if that’s in scope.
Alternative: For pure validation I’d just hand the string to ipaddress.IPv4Address(s) and catch the ValueError. Cleaner, and it can’t lie to you the way the regex can.
IPv6 address (simplified)
r"^(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}$"
What it does: Matches a fully written-out IPv6 address. All 8 hex groups, colons between them, nothing folded down.
Example: 2001:0db8:85a3:0000:0000:8a2e:0370:7334 matches.
Edge case: Here’s the catch, and it’s a big one. Real IPv6 in the wild almost never gets written out in full. The compressed :: form slips past it. So does IPv4-mapped ::ffff:1.2.3.4 and the %eth0 zone suffix. I’d reach for this only when I already know the input is expanded.
Alternative: ipaddress.IPv6Address(s) just gets every legal form right. For IPv6 I don’t even bother with regex anymore.
URL (http / https)
r"^https?://[\w.-]+(?:\.[a-zA-Z]{2,})+(?:[/?#][^\s]*)?$"
What it does: Grabs http and https URLs that carry a real TLD, with the optional path, query and fragment hanging off the tail.
Example: https://example.com/path?id=42#section matches.
Edge case: It only knows http and https. Throw ftp or mailto or a data: URI at it and you get nothing back. It doesn’t understand user:pass@ userinfo or IDN domains either. Fine for a link checker. Wrong for anything that has to swallow arbitrary URLs.
Alternative: urllib.parse.urlparse(s) never fails, it just parses. So check the scheme and netloc it hands back and decide for yourself.
Phone number in E.164
r"^\+[1-9]\d{6,14}$"
What it does: Checks the E.164 shape. A plus sign, then a country code that can’t start with zero, then 6 to 14 digits. That’s the format you want sitting in a database.
Example: +33612345678 matches.
Edge case: It checks the shape, not reality. +19999999999 sails right through, and no human could ever dial that. If “looks like a phone number” is enough for you, ship it. Just don’t go telling your users it’s verified, because it isn’t.
Alternative: When I actually need to trust the number, I reach for Google’s phonenumbers package. It knows the per-country length and prefix rules the regex never could.
ISO 8601 datetime
r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:\d{2})?$"
What it does: Matches the canonical ISO 8601 timestamp, fractional seconds and timezone offset both optional. The format APIs hand you all day long.
Example: 2026-05-27T14:30:00.123+02:00 matches.
Edge case: The regex counts digits, not calendars. So 2026-02-30 slides right through. February has opinions about that. The pattern does not share them. If a bad date can reach you, validate after the match, not instead of it.
Alternative: On 3.11+ I just call datetime.fromisoformat(s). It eats every ISO 8601 variant and rejects the dates that can’t exist. That’s what I’d do.
Date DD/MM/YYYY (European)
r"^(0[1-9]|[12]\d|3[01])/(0[1-9]|1[0-2])/\d{4}$"
What it does: Validates European DD/MM/YYYY. Keeps the day and month inside sane ranges instead of letting any two random digits through.
Example: 27/05/2026 matches; 32/05/2026 does not.
Edge case: Same trap as the ISO one. 31/02/2026 matches, because the pattern has no clue February stops at 28 or 29. If the calendar matters to you, finish the job with datetime.strptime(s, "%d/%m/%Y").
Alternative: datetime.strptime throws ValueError on a date that can’t exist. Which is exactly what you want for strict validation.
Time HH:MM:SS (24-hour)
r"^(?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d$"
What it does: Validates a 24-hour clock. Hours 00-23, minutes and seconds capped at 00-59. No 25:00 nonsense gets past it.
Example: 14:30:00 matches; 25:00:00 does not.
Edge case: It rejects leap seconds (23:59:60), which a handful of timestamp standards actually allow. You’ll probably never hit one. But if you’re parsing data from a system that emits them, you’ll be quietly dropping valid rows and never know why.
Alternative: datetime.time.fromisoformat when you want parsing stricter than a regex that’s only counting digits.
UUID v4 (random)
r"^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$"
What it does: Checks the v4 layout specifically. The version nibble has to be 4, and the variant nibble has to land on 8, 9, a or b. So it won’t wave through a random v1 UUID dressed up to look the part.
Example: 550e8400-e29b-41d4-a716-446655440000 matches.
Edge case: Lowercase only. Hand it an uppercase UUID and it just shrugs. This one bites me more than I’d like to admit. Add re.IGNORECASE or switch to [0-9a-fA-F] and move on with your day.
Alternative: uuid.UUID(s) doesn’t care about version or case. It just tells you whether it’s a UUID, full stop.
Hex colour code
r"^#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$"
What it does: Matches CSS hex colours in the lengths you actually paste. 3-digit shorthand (#f00), 6-digit (#ff0000), plus 8-digit RGBA (#ff0000ff).
Example: #1e293b matches.
Edge case: It skips the 4-digit shorthand RGBA (#f00f), which, honestly, I forget exists about half the time. If your design tokens lean on it, drop a {4} into the alternation and you’re covered.
Alternative: Modern CSS also lets you write named colours and function syntax like rgb() and oklch(). Regex won’t keep up with all that. If you need it covered, parse the CSS properly.
CSV row split with quoted fields
r'(?:^|,)("(?:[^"]|"")*"|[^,]*)'
What it does: Walks the fields of a comma-separated line and copes with the annoying part. Double-quoted fields that hide commas inside them, and the doubled "" escape on top.
Example: alice,"smith, jr.",42 yields three captures.
Edge case: Two gaps, and you’ll feel them fast. It can’t deal with a newline inside a quoted field, the kind where one CSV record spills across several lines. It also leaves the surrounding quotes stuck on the captures for you to strip yourself. Real CSV does both of those constantly.
Alternative: Look, just use csv.reader. It’s in the stdlib, it handles every one of these edge cases, and I only touch the regex when pulling in csv would feel like silly overkill.
Apache or nginx common log entry
r'^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) (\S+)" (\d+) (\S+)'
What it does: Pulls the seven fields out of a common-log line: client IP, timestamp, method, path, protocol, status, bytes sent. This is the one I’ve pasted into more throwaway scripts than anything else on this list.
Example: 127.0.0.1 - - [27/May/2026:14:30:00 +0200] "GET /index.html HTTP/1.1" 200 1234 matches.
Edge case: It stops at the common format. So the combined-format extras, referer and user-agent, just fall off the end. Most servers I touch log combined anyway, so I tack on "([^"]*)" "([^"]*)" and grab those two as well.
Alternative: Fine for a one-off grep. But once you’re parsing logs at any real volume, ship them to something like Loki with a proper parser pipeline. A hand-rolled regex per server doesn’t scale, and you’ll resent maintaining it within a month.
Whitespace collapse
re.sub(r"\s+", " ", text).strip()
What it does: Squashes every run of whitespace, tabs and newlines included, down to one space, then trims the ends. My go-to one-liner for cleaning up OCR output or scraped HTML. Also whatever a user just pasted into a box.
Example: " Hello\t\n world " becomes "Hello world".
Edge case: Run it on raw HTML and it’ll cheerfully flatten the whitespace inside your <pre> blocks too, which mangles your code samples. Point it at text content only. Never the markup.
Alternative: " ".join(text.split()) does the exact same thing with no regex at all, and it’s a touch faster on short strings. Half the time that’s what I actually reach for.
Markdown link extraction
r"\[([^\]]+)\]\(([^)]+)\)"
What it does: Splits a Markdown link into its two useful halves. The visible label in group 1, the URL in group 2, out of something like [label](https://example.com).
Example: See [the docs](https://docs.example.com) yields label “the docs” and URL “https://docs.example.com”.
Edge case: It breaks the second a bracket shows up in the label or a paren shows up in the URL. And Wikipedia URLs are absolutely riddled with parentheses, so this fails way more often than you’d guess. It can’t count nested brackets. No regex really can, that’s not a flaw you can patch out.
Alternative: For Markdown you didn’t write yourself, hand the job to a real parser like mistune or markdown-it-py and stop guessing.
JSON number
r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?"
What it does: Matches a number the way the JSON spec actually defines it. Optional sign, no sneaky leading zero, then an optional decimal and exponent. It mirrors the grammar, which is the entire point.
Example: -1.23e-4 matches; 007 does not.
Edge case: It won’t match Infinity or NaN or a hex literal. That’s deliberate, because none of those are legal JSON to begin with. If your “JSON” is carrying them around, something upstream is already lying to you.
Alternative: When it’s a whole document instead of a stray token, skip the regex and let json.loads(s) do the parsing.
Python identifier (ASCII-only)
r"^[A-Za-z_][A-Za-z0-9_]*$"
What it does: Checks the classic ASCII-only variable name. Starts with a letter or underscore, then letters or digits or underscores the rest of the way down.
Example: my_var2 matches.
Edge case: Modern Python is perfectly happy with Unicode names like élève or π, and this pattern slams the door on every single one. So if you’re accepting Unicode, this regex will reject code that runs absolutely fine. Annoying.
Alternative: Skip the regex entirely and call "name".isidentifier(). It’s the official check, and it already knows the Unicode rules cold.
Unix file path
r"^/(?:[^/\x00]+/)*[^/\x00]*$"
What it does: Matches an absolute Unix path. Forward slashes, and no null bytes hiding in the segments.
Example: /home/user/file.txt matches.
Edge case: Read this one twice if you care about security. It happily accepts .. and . segments, which is exactly how a path-traversal attack climbs out of the directory you meant to confine it to. “It matched my path regex” is not a security check, full stop. Run the thing through pathlib.Path.resolve() and confirm where it really lands before you trust a byte of it.
Alternative: pathlib.PurePosixPath(s) parses the string without ever touching the filesystem.
Windows file path
r'^[A-Za-z]:\\(?:[^\\/:*?"<>|]+\\)*[^\\/:*?"<>|]*$'
What it does: Matches an absolute Windows path. Drive letter, backslash separators, and it throws out the characters Windows forbids in a filename.
Example: C:\Users\admin\file.txt matches.
Edge case: Windows paths are a swamp. This pattern misses UNC shares (\\server\share) and the long-path prefix (\\?\). It also ignores the fact that Windows quietly accepts forward slashes. I’ve been burned by all three, separately, on bad days. Cover only the cases you’re sure you’ll actually see.
Alternative: pathlib.PureWindowsPath(s) parses without going anywhere near the disk, and it knows the quirks this regex doesn’t.
Strong password
r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^A-Za-z0-9]).{12,}$"
What it does: Enforces the usual composition rules with four lookaheads. At least 12 characters, and one each of lowercase, uppercase, a digit, plus a symbol.
Example: P@ssw0rd-StrongEnough matches.
Edge case: Here’s the uncomfortable truth, and I might be in the minority on how loudly to say it. Ticking these boxes has almost nothing to do with whether a password is actually strong. P@ssw0rd1234 passes every rule here and sits in basically every breach dump on the internet. Composition rules measure how obedient a password is. Entropy is a separate question. Always pair this with a check against a known leaked-password list.
Alternative: zxcvbn-python actually estimates how guessable a password is, and it spots the dictionary words and keyboard walks a regex will never see.
URL-friendly slug
r"^[a-z0-9]+(?:-[a-z0-9]+)*$"
What it does: Validates a clean lowercase slug. Alphanumeric groups joined by single hyphens, with no hyphen sitting at the start or the end, and none doubled up through the middle.
Example: my-awesome-post-2026 matches; --bad-- does not.
Edge case: It validates a slug. It doesn’t make one. Feed it "Café Brûlé" and it just says no, because the accents stop it cold. You’ve got to normalise with unicodedata.normalize first to strip the diacritics, then validate the result.
Alternative: For actually generating slugs I don’t reinvent any of this. python-slugify handles the normalising and transliteration and length-capping in one single call.
Test your regex live?
Our Regex Tester runs Python-flavour patterns against your sample input. It highlights the capture groups and previews the replacement as you type. Nothing to install. Paste and go.
Patterns to avoid in 2026
A few “classic” regexes keep getting copy-pasted out of decade-old StackOverflow answers, and every one has burned me at least once. Take the “perfect email” regex, that 1,000-character monster trying to be RFC 5322 on a single line. It rejects real addresses that deliver mail just fine. Use a short pragmatic pattern, then send a confirmation email and be done. The inbox is the only validator that actually counts. Then there’s the “URL that matches everything”, with every scheme and userinfo and IDN bolted on. It’s brittle, and urllib.parse.urlparse beats it on both speed and correctness anyway. And the “credit-card validator” that sniffs out the card brands by their number prefixes? That one’s worse than useless. It hands you a warm feeling of security you didn’t earn. The Luhn check tells you a number is well-formed. Only the bank tells you it’s real, and a bare regex can’t even manage the first part.
Performance notes for high-volume regex
Compile once, reuse forever. That’s the one habit that pays off every single time. Put re.compile(pattern) up at module scope and call PATTERN.match(s) inside your hot loop. Yes, re caches compiled patterns internally already. But leaning on that cache in a tight loop still measurably loses to a plain module-level constant, and the intent reads clearer anyway. Once you’re into millions of matches per second, reach for the regex PyPI package. Its possessive quantifiers and atomic groups shut down catastrophic backtracking before it can hang the whole process. And if you’ve got a fixed set of patterns and don’t need named captures, the hyperscan bindings can hand you roughly 100x. I save that one for when the profiler actually points there. Not a moment before, because it’s a real dependency to carry.
Frequently asked questions
Why use raw strings (r"...") for every pattern?
Because backslashes mean something to both Python and the regex engine, and that double meaning is exactly where the bugs hide. Without the r, Python reads "\d" as a literal backslash-d. It still works, sure, but now you’re squinting at it. And worse, "\n" turns into a real newline before the regex even gets a look. The raw prefix makes your pattern read just like the docs you lifted it from. I put r on every pattern. No exceptions, and there’s genuinely no downside I’ve ever found.
When is regex the wrong tool?
Anytime the thing nests. HTML, JSON, anything with a recursive grammar, regex literally can’t parse it correctly. And that’s not me being fussy or purist about it. It’s the actual math behind that famous “you can’t parse HTML with regex” rant. Dates and paths and UUIDs already have a library each (datetime, pathlib, uuid) that’s stronger and reads better than whatever pattern you’d cook up. Where regex earns its keep is the flat, well-defined stuff. Stay in that lane and it’s a pleasure to use. Outside it, you’ve signed up for a rough few days.
How do I avoid catastrophic backtracking?
Start with non-capturing groups (?:...) instead of (...) when you don’t need the capture. It’s free, so why not. Possessive quantifiers (*+, ++) from the regex module kill backtracking on the greedy stretches that tend to blow up. Then I attack my own pattern on purpose. Feed it something pathological, like the letter a repeated thirty-odd times, and just watch. If it hangs on that toy input, it’ll hang on a hostile request in prod, and at least now you found out first. Refactor until it doesn’t.
Does Python regex support look-behind?
Yep. Fixed-width look-behind (?<=...) has been around forever, and variable-width landed back in Python 3.7. Look-aheads (?=...) have always worked too. One piece of advice though, and maybe it’s just me being conservative: go easy on them. The day I catch myself stacking three look-arounds into a single pattern is usually the day I should’ve split it into two plain passes I can still read six months from now.
What about Unicode classes like \p{L}?
The stdlib re doesn’t do \p{...} property classes at all. Try it and you’ll just get an error. The regex module on PyPI does, and it’s a drop-in replacement, so you swap the import and barely notice the difference. My rule of thumb runs like this. If I’m only validating ASCII, plain re is fine and I keep the dependency count low. The moment real Unicode is on the table, I install regex and quit fighting the problem.
Where can I download this cheatsheet?
Easiest path? Just print the page to PDF straight from your browser. Ctrl+P on Windows, Cmd+P on macOS, then “Save as PDF”. I built the layout to print clean, with no banner ads chewing up the margins. An exportable JSON of all the patterns is something I’d like to add eventually, honestly, but the browser print is what I’d reach for today.













