Duncan's Diurnal Diatribe: SplitColumns in Power Query

Duncan Williamson

May 2026

Introduction

I have downloaded 2400 rows of data that all contain the same dirty data problem and this page tells you what the problem is and how I solved it

with ChatGPT

with REGEX() formulas

with Claude

The page discusses the Claude solution and it begins with my opening Prompt:

I would like your help with cleaning the data in column B of this worksheet. I downloaded the table from a web site and column B should include the name of the company and its objectives. What they did on the website is to put both bits of information in the same cell of the table. You task is to split out name from Objectives> The following will help you. Usually, the name ends with a lower case letter and the Objective starts with an upper case letter and there is not space or other delimiter between them. However, in some cases the split is number to upper case or . to uppercase or some other character.

Claude in Excel

I'll take a look at column B to understand the data, then split it.

Read data

Let me check the total row count and sample more variations.

Check used row count

I'll process all 2400 data rows in JavaScript using a regex-based split, then write Name into column B and Objective into a new column C (shifting the rest). Let me first check what's in columns C-G so I know what I'm shifting.

Read data ... I'll insert a new column C for "Objective", then split B into B (Name) and C (Objective). The split rule: find the boundary where a lowercase letter, digit, or punctuation (., ), ], &) is followed by an uppercase letter.

Split column B into Company and Objective

Done. Split 2,400 rows —

company name now in B:B

objective in new column C:C

The other columns (Location, Year end, etc.) shifted right.

Spot check a few edge cases to confirm the split looks right:

Read data ... Found one issue: "MacGregor Services" got split as "Mac" / "Gregor Services…" because of the internal capital. The non-greedy regex grabbed the earliest lowercase→uppercase boundary. I need to prefer the last such boundary instead, since the objective always starts further right (it's a sentence, not one word).

Let me fix the regex to be greedy and rerun. I'll also re read the original column B from the source — but wait, I've already overwritten it. Let me check whether the original is recoverable… it isn't. I need to reconstruct by joining B+C back together, then re split with a smarter rule.

Re split using greedy boundary

Fixed: MacGregor Services now in tact, and 214 rows total were corrected. Let me sanity check a broader sample to make sure the greedy approach didn't introduce its own problems:

Read data ... Row 49 didn't split: "Hancock Group – HMGDealerships for Ford cars and trucks" stayed in column B. Reason: my greedy regex requires a lowercase letter immediately before the uppercase. Here it's "G" (uppercase) → "D" (uppercase), so the lowercase pattern fails, and the digit/punct fallbacks don't apply either. The boundary is uppercase→uppercase after a token like "HMG".

Let me also check for other unsplit rows across the full dataset:

Find rows with empty Objective

99 rows still unsplit: all are uppercase acronym boundaries (e.g., "MBI MBO|Hotelliers", "Focus DIY|DIY retailer"). I need a smarter rule: split where 2+ uppercase letters are followed by a single uppercase letter then lowercase letters (the start of a normal word like "Hotelliers"). That captures …HMG|Dealerships, …MBO|Hotelliers, …DIY|DIY retailer, etc.

Let me apply this as a second pass to only the unsplit rows:

· Second-pass split on acronym boundaries

· Down to 9 stragglers. Looking at them:

o Row 78: (Crazy Eddie's)*Distributers... — split should be after *

o Row 538: Focus DIYDIY retailer — DIY|DIY (uppercase→uppercase, no lowercase follows immediately)

o Row 571: Food Brokers — no objective at all, leave as is

o Row 732: Veritas ITIT recruitment agency — IT|IT

o Row 769: RAPIDIT component distributor — RAPID|IT

o Row 1788: LA Micro UKIT reseller — UK|IT

o Rows 1897, 1943: High Street TVTV shopping channel — TV|TV

o Row 1990: Rosa's Thai CaféThai restaurant chain — é (lowercase accented) → T, my regex [a-z] missed the accented char

Let me handle these with targeted patterns: * boundary, accented letter→uppercase, and the all uppercase→uppercase case:

All 2,400 rows split.

Summary:

B = Company name, C = Objective (new column inserted; D–H shifted right)
First pass (greedy lowercase→uppercase) handled ~2,300 rows
Second pass (uppercase acronym → Uppercase+lowercase word) caught 90 more like "MBI MBO|Hotelliers", "Hancock Group – HMG|Dealerships"
9 edge cases (accented letters, * delimiter, all-caps adjacencies like "DIY|DIY", "TV|TV") fixed manually

One row, "Food Brokers" at B571, had no objective in the source data: left as company name only.

Time taken: approx. one minute

Remove unwanted characters from values column

Prompt: And now column G: I am sure you can see and cure the problem without my guidance: leave all data in column G, however

Read data ... I see values like "†21,217" have a leading dagger character that turns the number into text. Let me scan column G to confirm and find any other non-numeric prefixes:

Find non-numeric prefixes in G

Two prefix markers: † (83 rows) and * (599 rows). You said "leave all data in column G" — I'll move the markers into a new column H ("Note") so G becomes clean numbers, and the marker info is preserved.

Strip markers from G into new column H

Done.

G:G is now clean numeric sales (formatted with thousands separator), and the † / * markers moved into new column

H:H "Sales note" — 83 daggers and 599 asterisks preserved. Columns to the right shifted accordingly.

Prompt: You did not need to preserve the markers as they are irrelevant to this file, please remove them

Delete column H

Removed column H.

I appreciate your conservative approach but it is appropriate in this case

Noted: I'll be less conservative about discarding clearly irrelevant artifacts going forward.

Time taken: Less than one minute