Sunday, April 03, 2022

Easier Regexps

 Python, although a scripting language, does not bake regular expressions into the language syntax like other scripting languages such as AWK, Perl and Ruby. This can make people pause before using the re library that Python supplies, but it has its place, and regular expressions are a handy tool in results.

Over the decades, I have refined my use of regular expressions and want to pass on a useful style, and tools that I find very useful.

1. Use Regex101.com!

  • Open the site in another tab.
  • Reading down the main textual menu to the left, select the  Python language.

2. Make regexps readable.

  • In the  "REGULAR EXPRESSIONS" section, top centre:
    • Hover over the r" text to the left of the entry box and it will say "Change delimiter". Click and change the delimiter to r""".
    • Hover over the """gm text to the left of the central window which should then say "Set Regex Options". Select the g, m and x options. 

The above allows you to enter a regexp over multiple lines and with comments. Spaces are ignored on the whole, and you should use the pattern \s or \n etc to represent whatever whitespace you wish to match.

3. Get a representative selection of what you wish to match.

Know your data! Read it, and select sections that convey all the intricacies that your regexp needs to match. Careful when taking several excerpts from, say, a log, to ensure that the excerpts, together are valid.

  • Paste your sample text in the main "TEST STRING" section.

4. Use named capture groups.

When creating your exression in the top centre section any capture-group should be a named capture-group.

  • Use uppercase group names - this usually contrasts with other parts of the regexp aiding readability.
  • Use short, meaningful group names.

These names might become lowercased variable names in a program.

5. You can comment in the regexp!!!

Yes you can.

6. Debug on site.

Regex101 explains the regexp and all the matches of the regexp on the test string as you hover on parts of the regexp or test string. There is extra info in the right hand side sections too.

7. Export the code.

  • Under the lower LHS TOOLS menu select Code Generator.

Cut-n-paste the code into your favourite editor.

I tend to take only parts of the generated code: the regexp, the test_str and the re compile options are useful.

An Example: Harvesting colours.

I was looking for a list of distinctive colours and found this page: List of 20 Simple, Distinct Colors

If you click on the convenient button, the colours adjust to a nice order, but selecting a region around the colour table and pasting into an editor gives one long  concatenation of the table information amongst surrounding text.

I wanted something more readable so pasted the long string into regex101 here and created the regular expression to parse it. (Oh, that's another feature - you can create a regex101 account and save regexps).

The final script creates a nicely formatted Python list of the data:

# -*- coding: utf-8 -*-
"""
Parse Table grab from https://sashamaps.net/docs/resources/20-colors/

Created on Sat Apr  2 19:36:07 2022

@author: paddy
"""

_initial_screen_cut = """
Accessibility: Red#e6194B1Green#3cb44b2Yellow#ffe1193Blue#4363d84Orange#f582315Purple#911eb46Cyan#42d4f47Magenta#f032e68Lime#bfef459Pink#fabed410Teal#46999011Lavender#dcbeff12Brown#9A632413Beige#fffac814Maroon#80000015Mint#aaffc316Olive#80800017Apricot#ffd8b118Navy#00007519Grey#a9a9a920White#ffffff21Black#00000022
'#e6194B', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#42d4f4', '#f032e6', '#bfef45', '#fabed4', '#469990', '#dcbeff', '#9A6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#a9a9a9', '#ffffff', '#000000'
Test:

"""
import re

regex = r"""
	# Parse Table grab from https://sashamaps.net/docs/resources/20-colors/
	(?:
	  (?P<NAME>[A-Za-z]+)
	  \# (?P<HEX>[0-9a-fA-F]{6})
	  (?P<ID>\d+)
	)
"""

test_str = ("Red#e6194B1Green#3cb44b2Yellow#ffe1193Blue#4363d84Orange#f582315Pu"
            "rple#911eb46Cyan#42d4f47Magenta#f032e68Lime#bfef459Pink#fabed410Te"
            "al#46999011Lavender#dcbeff12Brown#9A632413Beige#fffac814Maroon#800"
            "00015Mint#aaffc316Olive#80800017Apricot#ffd8b118Navy#00007519Grey#"
            "a9a9a920White#ffffff21Black#00000022")

distinct = [(int(match.groupdict()['ID']), match.groupdict()['NAME'], int(match.groupdict()['HEX'], 16))
            for match in  re.finditer(regex, test_str, re.MULTILINE | re.VERBOSE)]

#print(f"{distinct=}")

print('\ndistinct_colours = [\n# NAME          HEX_CODE')
for id, name, num in distinct:
    print(f" ({repr(name)+',':<12}  0x{num:06x}),")
print(']')

The output:

distinct_colours = [
# NAME          HEX_CODE
 ('Red',        0xe6194b),
 ('Green',      0x3cb44b),
 ('Yellow',     0xffe119),
 ('Blue',       0x4363d8),
 ('Orange',     0xf58231),
 ('Purple',     0x911eb4),
 ('Cyan',       0x42d4f4),
 ('Magenta',    0xf032e6),
 ('Lime',       0xbfef45),
 ('Pink',       0xfabed4),
 ('Teal',       0x469990),
 ('Lavender',   0xdcbeff),
 ('Brown',      0x9a6324),
 ('Beige',      0xfffac8),
 ('Maroon',     0x800000),
 ('Mint',       0xaaffc3),
 ('Olive',      0x808000),
 ('Apricot',    0xffd8b1),
 ('Navy',       0x000075),
 ('Grey',       0xa9a9a9),
 ('White',      0xffffff),
 ('Black',      0x000000),
]

 

END.