How to correctly parse file names in Bash

Bash file naming conventions are very rich and it is easy to create a script or line that incorrectly parses file names. Learn to parse file names correctly and, because, make sure your scripts work as intended.

The problem of correctly parsing file names in Bash

If you've been using Bash for a while and have been writing in its rich Bash language, you have probably run into some file name parsing problems. Let's take a look at a simple example of what can go wrong:

touch 'a
> b'

Here we create a file that has a CR (car return) entered it by pressing enter after the a. Bash file naming conventions are very rich, And even though it's somehow cool, we can use special characters like these in a filename, let's see how this file fares when we try to perform some actions on it:

ls | xargs rm

This is not functional. xargs will take the input of ls (through him | pipeline) and pass it to rm, But something went wrong in the procedure!

What went wrong is that the output of ls is taken literally by xargs, and the 'enter’ (CR – car return) inside the filename is seen by xargs like a real ending character, not a CR to be passed to rm as it should be.

Let's exemplify this in another way:

ls | xargs -I{} echo '{}|'

It's clear: xargs you are processing the input as two individual lines, splitting the original file name in two. Even if we had to fix space problems through elegant analysis using thirst, soon we would encounter other problems when we started using other special characters as spaces., back bars, quotation marks and more.

touch 'a
b'
touch 'a b'
touch 'ab'
touch 'a"b'
touch "a'b"
ls

Even if you are an experienced Bash developer, you may be shaken by seeing file names like this, since it would be very complex, for most common Bash tools, scan these files correctly. You would have to do all sorts of chain modifications for this to work.. In other words, unless you have the secret recipe.

Before we dive into that, there is one more thing, something you should know, you may come across when analyzing ls production. If you use color coding for directory listings, which is enabled by default in Ubuntu, it's easy to run another set of ls analysis problems.

These are not truly related to how the files are named, but rather with how the files are presented as output from ls. the ls the output will contain hex codes representing the color to be used in your terminal.

To avoid encountering these, just use --color=never as an option for ls:
ls --color=never.

And Mint 20 (a great OS derived from Ubuntu), this problem seems solved, even though it is possible that the problem is still present in many other versions of Ubuntu or older, etc. I have seen this problem in mid August 2020 in Ubuntu.

Even if you don't use color coding for your directory listings, your script is likely to run on other systems that are not owned or managed by you. In that case, you will also want to use this option to prevent users of that machine from running into the problem described.

Going back to our secret recipe, Let's see how we can make sure that we will not have problems with the special characters in the Bash file names. The answer provided avoids all use of ls, that you would do well to avoid in general, so color coding problems are also not applicable.

There are still times when ls Analysis is fast and convenient, but it will always be complicated and probably ‘dirty’ as soon as special characters are entered, not to mention they are unsafe (special characters can be used to introduce all kinds of problems).

The secret recipe: NULL termination

The developers of Bash tools have realized this same problem many years before and have provided us: NULL termination!

What is it NULL completion questions? Consider how in the above examples, CR (the literally get into) was the ending main character.

We also saw how you can use special characters such as quotes, blanks and backslashes in file names, even though they have special functions when it comes to other Bash text analysis and modification tools like sed. Now compare this with the -0 option a xargs, from man xargs:

-0, -null Input items end with a null character instead of a blank space, and the quotes and the backslash are not special (all characters are taken literally). Disable end of file string, which is treated like any other argument. Useful when input items can contain blanks, quotes or backslashes. The GNU find -print0 option produces a suitable input for this mode.

And the -print0 option a find, from man find:

-fprint0 archivo Certain; prints the full file name to standard output, followed by a null character (instead of the newline character that uses -print). This enables programs that process search output to correctly interpret file names that contain new lines or other types of white space.. This option corresponds to the option -0 de xargs.

the Certain; here means If the option is specified, the following is true;. Also interesting are the two clear warnings that are given in other parts of the same manual page:

If you are piping the output of find to another program and there is the slightest chance that the files you are looking for contain a new line, then you should seriously consider using the -print0 option instead of -print. See the UNUSUAL FILE NAMES section for information on how unusual characters are handled in file names..
If you are using search in a script or in a situation where matching files may have arbitrary names, you should consider using -print0 instead of -print.

These clear warnings remind us that analyzing file names in bash can be, And it is, a complicated business. Despite this, with the right alternatives for find, namely -print0, and xargs, namely -0, all our special characters containing file names can be scanned correctly:

ls
find . -name 'a*' -print0 
find . -name 'a*' -print0 | xargs -0 ls
find . -name 'a*' -print0 | xargs -0 rm

First we review our directory list. All our filenames containing special characters are there. Next we do a simple find ... -print0 to see the output. We observe that the strings are NULL finished (with the NULL O – the same character – not visible).

We also note that there is only one CR at the exit, that matches the single CR that we had entered in the first file name, composed by a followed by get into followed by B.

To end, the output does not introduce a new line (which also contains CR) before returning the $ terminal indicator, since the chains were NULL and no CR finished. We press enter in the $ terminal prompt to explain things a bit.

Then we add xargs with the -0 options, what enables xargs to handle the NULL finished the entry correctly. We see that the input passed and received from ls is clear and there is no alteration of the text transformation.

To finish we try again our rm command, and this time for all files, including the original containing the CR with whom we had problems. the rm works great and there are no errors or parsing problems. Excellent!

Ending

We have seen how essential, in several cases, parse and handle file names correctly in bash. While learning to use find correctly is a bit more challenging than just using ls, the benefits it provides may pay off in the end. Greater security and no hassle with special characters.

If you enjoyed this post, you might also want to read How to Rename Files to Numeric File Names in Linux, showing an interesting and somewhat complex find -print0 | xargs -0 statement. Enjoy!