Introduction
I have downloaded 2400 rows of data that all contain the same dirty data problem and this page tells you what the problem is and how I solved it
with ChatGPT
with REGEX() formulas
with Claude
The page discusses the Claude solution and it begins with my opening Prompt:
I
would like your help with cleaning the data in column B of this worksheet. I
downloaded the table from a web site and column B should include the name of
the company and its objectives. What they did on the website is to put both
bits of information in the same cell of the table. You task is to split out
name from Objectives> The following will help you. Usually, the name ends
with a lower case letter and the Objective starts with an upper case letter and
there is not space or other delimiter between them. However, in some cases the
split is number to upper case or . to uppercase or some other character.
Claude in Excel
I'll
take a look at column B to understand the data, then split it.
Read
data
Let
me check the total row count and sample more variations.
Check used row count
I'll
process all 2400 data rows in JavaScript using a regex-based split, then write
Name into column B and Objective into a new column C (shifting the rest). Let
me first check what's in columns C-G so I know what I'm shifting.
Read data ... I'll insert a new column C for "Objective", then split B into B (Name) and C (Objective). The split rule: find the boundary where a lowercase letter, digit, or punctuation (., ), ], &) is followed by an uppercase letter.
Split column B into Company and Objective
Done.
Split 2,400 rows —
company
name now in B:B
objective
in new column C:C
The
other columns (Location, Year end, etc.) shifted right.
Spot
check a few edge cases to confirm the split looks right:
Read data ... Found one issue: "MacGregor Services" got split as "Mac" / "Gregor Services…" because of the internal capital. The non-greedy regex grabbed the earliest lowercase→uppercase boundary. I need to prefer the last such boundary instead, since the objective always starts further right (it's a sentence, not one word).
Let
me fix the regex to be greedy and rerun. I'll also re read the original column
B from the source — but wait, I've already overwritten it. Let me check whether
the original is recoverable… it isn't. I need to reconstruct by joining B+C
back together, then re split with a smarter rule.
Re split using greedy boundary
Fixed: MacGregor Services now in tact, and 214 rows total were
corrected. Let me sanity check a broader sample to make sure the greedy
approach didn't introduce its own problems:
Read data ... Row 49 didn't split: "Hancock Group – HMGDealerships for Ford cars and trucks" stayed in column B. Reason: my greedy regex requires a lowercase letter immediately before the uppercase. Here it's "G" (uppercase) → "D" (uppercase), so the lowercase pattern fails, and the digit/punct fallbacks don't apply either. The boundary is uppercase→uppercase after a token like "HMG".
Let
me also check for other unsplit rows across the full dataset:
Find rows with empty Objective
99 rows still unsplit: all are uppercase acronym boundaries
(e.g., "MBI MBO|Hotelliers", "Focus DIY|DIY retailer"). I
need a smarter rule: split where 2+ uppercase letters are followed by a single
uppercase letter then lowercase letters (the start of a normal word like "Hotelliers").
That captures …HMG|Dealerships, …MBO|Hotelliers, …DIY|DIY retailer, etc.
Let me apply this as a second pass to only the unsplit rows:
·
Second-pass split on acronym boundaries
·
Down to 9 stragglers. Looking at them:
o Row 78: (Crazy
Eddie's)*Distributers... — split should be after *
o Row 538: Focus DIYDIY
retailer — DIY|DIY (uppercase→uppercase, no lowercase follows immediately)
o Row 571: Food Brokers —
no objective at all, leave as is
o Row 732: Veritas ITIT
recruitment agency — IT|IT
o Row 769: RAPIDIT
component distributor — RAPID|IT
o Row 1788: LA Micro UKIT
reseller — UK|IT
o Rows 1897, 1943: High
Street TVTV shopping channel — TV|TV
o Row 1990: Rosa's Thai
CaféThai restaurant chain — é (lowercase accented) → T, my regex [a-z] missed
the accented char
Let
me handle these with targeted patterns: * boundary, accented letter→uppercase,
and the all uppercase→uppercase case:
All 2,400 rows split.
Summary:
- B =
Company name, C =
Objective (new column inserted; D–H shifted right)
- First
pass (greedy lowercase→uppercase) handled ~2,300 rows
- Second
pass (uppercase acronym → Uppercase+lowercase word) caught 90 more like
"MBI MBO|Hotelliers", "Hancock Group –
HMG|Dealerships"
- 9 edge
cases (accented letters, * delimiter, all-caps
adjacencies like "DIY|DIY", "TV|TV") fixed manually
One row, "Food Brokers" at B571, had no
objective in the source data: left as company name only.
Time taken: approx. one minute
Remove unwanted characters from values column
Prompt: And now column G: I am sure you can see and cure the problem without
my guidance: leave all data in column G, however
Read data ... I see values like "†21,217" have a leading dagger character that turns the number into text. Let me scan column G to confirm and find any other non-numeric prefixes:
Find non-numeric prefixes in G
Two prefix markers: † (83 rows) and * (599 rows). You said "leave
all data in column G" — I'll move the markers into a new column H
("Note") so G becomes clean numbers, and the marker info is
preserved.
Strip markers from G into new column H
Done.
G:G is now clean numeric sales (formatted with thousands separator), and
the † / * markers moved into new column
H:H "Sales note" — 83 daggers and 599 asterisks preserved.
Columns to the right shifted accordingly.
Prompt: You did not need to preserve the markers as they are irrelevant to
this file, please remove them
Delete column H
Removed column H.
I appreciate your conservative approach but it is appropriate in this
case
Noted: I'll be less conservative about discarding clearly irrelevant
artifacts going forward.
Time taken: Less than one minute
Duncan Williamson
Between the Sheets: excellence since 2000
8th May 2026

No comments:
Post a Comment