Find Invisible Unicode Characters aka “AI Watermarks”
Clean Up Your Text and Uncover AI Traces: A Tool for Finding and Removing Hidden Unicode
Have you ever faced accusations of using ChatGPT to write your essays? Or maybe you’ve stared at code or data that looked flawless yet stubbornly refused to cooperate? The underlying cause might be something invisible to the naked eye: Unicode characters that don’t render visually, now sometimes referred to as “AI watermarks.”
But there’s a way to find them! Keep reading to learn how.
It’s becoming increasingly apparent that Large Language Models (LLMs) like ChatGPT frequently incorporate these invisible Unicode characters into their output, inadvertently creating spacing and formatting anomalies. This prevalence has led some to dub them “AI watermarks.” While not a definitive method for detecting AI-generated text, the presence of these obscure Unicode characters can be a strong indicator. Finding them embedded in a student’s essay or a colleague’s email raises a significant likelihood of AI involvement, as these are not characters typically introduced through manual typing.
For example, the text you copied from ChatGPT might appear like this:
Hello world! This is an example text with some invisible characters.
Can you find non-printable spaces between or behind words?
However, if you were able to reveal the underlying Unicode characters, the same text might actually look like this:
HelloU+A0world! This is an exampleU+A0text with some invisible characters.
Can you find non-printable spaces between or be'U+200B'hindU+FEFF' words?
Notice theU+A0, U+FEFF and U+200B elements? These are the invisible Unicode characters that can make a developer’s life difficult or expose the use of AI in a piece of writing.
This is because LLMs, like ChatGPT, seemed to have picked up on the fact that those invisible Unicode characters are a great way to create spaces and other formatting in their outputs. Hence, ChatGPT and other AI generated text is full of them. Of course, this is still not a bullet-proof way of detecting AI-generated text, but it comes pretty close. If you find these invisible Unicode characters in your student’s essay, or your boss’s email, chances are it has been processed with AI in some way, because you would not normally enter them yourself when writing a text.
While their exact purpose isn’t always clear — it could be an artifact of the generation process, or even a subtle, unintentional fingerprint — their presence can cause significant headaches for developers and data scientists.
Let’s take a closer look at what these characters are and how you can identify and eliminate them.
Understanding These “Invisible” Characters
These are subtle Unicode characters that, while visually undetectable, can significantly disrupt:
- String matching
- Text rendering
- Database searching and indexing
- Machine learning pipelines
Beyond these technical issues, some tech-savvy individuals are also exploring the presence of these subtle characters as a potential, albeit not foolproof, way to identify AI-generated text.
Common offenders include:
- U+00A0 — Non-breaking space: While it looks like a standard space, it behaves differently.
- U+200B — Zero-width space: A character with no visual width.
- U+200C / U+200D — Zero-width joiners: Used for complex script rendering but can be problematic in plain text.
- U+FEFF — Byte Order Mark (BOM): Often found at the beginning of files and can interfere with processing.
These characters frequently infiltrate your workflows through sources like:
- Copy-pasted content
- AI-generated text
- User-submitted forms
- Exports from PDF or Word documents
- Server responses
Here Is How To Find Unicode And Clean Your Text
One option is to find a website that does the job for you. Another is to run a script yourself. To detect and clean text from non-printable Unicode characters, I wrote a small Python script designed to highlight and clean these invisible Unicode characters from .txt files.
Here’s What the Script Does
- Highlights hidden characters in the console, providing details such as type, Unicode value, line number, and column number.
- Replaces problematic space-like characters with standard spaces.
- Removes other non-printing characters.
- Generates a cleaned output file named yourfilename.txt.clean.txt
Here’s How to Use the Script
- Download the script from my GitHub repo (you’ll find it under
highlight_and_clean_unicode.py
). - Save it anywhere on your machine.
- Then, run the following line in your terminal:
python3 highlight_and_clean_unicode.py yourfile.txt
And the script will provide output similar to this, highlighting the problematic characters:
Line 1, Col 6: [NBSP] U+00A0
>> Hello world! This is an example text with some invisible characters.
Line 2, Col 48: [ZWSP] U+200B
>> Can you find non-printable spaces between or behind words?
In this output:
[NBSP]
indicates a Non-Breaking Space (U+00A0).[ZWSP]
indicates a Zero-Width Space (U+200B).- The
>>
prefix shows the line of text where the invisible character was found, with inserted to make it visible in the output. The "Col" number indicates the character's position on that line.
Open Source
This script is open-source, lightweight, and distributed under the MIT license. You are welcome to fork the repository, contribute improvements, or integrate it into your existing data pipelines. Here is my GitHub link:
GitHub link: https://github.com/clemensjarnach/highlight_and_clean_unicode
📰 Subscribe for more posts like this: Medium | Clemens Jarnach ⚡️