Asubst

Asubst is a command line tool that substitutes text within an input flow of text. It finds occurences of text that match a pattern (regular expression), and replaces these occurrences by a given string.

1. General syntax

Usage: asubst [ { <option > } ] <find_pattern> <replace_string> [ { <file> } ]
or   : asubst -h | --help | -v | --version

Substitutes strings in files.

If no file is provided or if <file> is "-", then substitutes from stdin to stdout.

Warning

Regular expressions are powerful and automatic substitution can be dangerous, so use asubst with caution. Test your pattern with:
echo string | asubst <search_pattern> <replace_string>
and use option -s or -tV if unsure.

See "man 3 pcre" for Perl Compatible Regular Expressions and "man 1 perlre" for Perl Regular Expressions.

2. Options

-a or --ascii : consider that the input flow or file is pure ASCII,
-D <string> or --delimiter=<string> for a delimiter other than \n,
-d or --dotall for allow "." to match "\n", when -D is set,
-e <pattern> or --exclude=<pattern> for skip text matching <pattern>,
-F <file> or --file-list=<file> to provide a file list of file names,
-f or --file for display the matching file names in grep mode,
-g or --grep : print matching text as grep would do (no substitution),
-i or --ignorecase : do case insensitive match checking of <find_pattern>,
-I or --invertmatch : invert grep matching as "grep -v" would do,
-L or --list : print the files matching text as "grep -l" would do (no substitution),
-l or --line for display the line numbers in grep mode,
-m <range> or --match=<range> for substitution of only <range> matches,
-n or --number : print number of substitutions per file,
-p <dir> or --tmp=<dir> : directory for temporary files,
-q or --quiet : no printout,
-s or --save : make a backup (<file>.asu) of the original file(s),
-t or --test for test, substitutions are not performed,
-u or --utf8 : process utf-8 sequences,
-V or --verbose : print each substitution and its line number,
-x or --noregex : <find_pattern> is considered as string(s),
-- : to stop the list of options. If <find_pattern> is "-pat", then use: asubst [ { <option> } ] -- "-pat" <replace_pattern> [ { <file> } ].

3. Environment

If the LANG environment variable is set to a string containing "UTF-8" or "UTF8" (with whatever casing), then asubst considers that the input flow or file is encoded in Utf-8. This setting can be altered in a way or the other by setting ASUBST_UTF8 environment variable to "Y", "yes", "N" or "no" (with whatever casing). Finally, this setting can be altered in a way or the other by the options -a (--ascii), or -u (--utf8).

If asubst_TRACE environment variable is set to "Debug" (with whatever casing) then asubst puts on stderr some debugging information. asubst_TRACE_Substit, asubst_TRACE_Search and asubst_TRACE_Replace can also be set to "Debug" for more detailed debugging information.

4. Find pattern

This section describes the find pattern in regex mode. See section Noregex mode for a description of find string in noregex mode.

<find_pattern> ::= <single_regex> | <multiple_regex>

<multiple_regex> ::= { [ <single_regex> ] \n } [ <single_regex> ]

Single regex

If there is no "\n", then the regex can be any regular expression.

If it does not start with '^' nor ends with '$', then it is applied several times to each input line (applied each time to the text following the previous substitution).
Examples:

echo "toto" | asubst "t" "x"   ->  "xoxo"
echo "toto" | asubst "t" "ti"  ->  "tiotio"
echo "tito" | asubst "t." "tu" ->  "tutu"

Significant '^' and '$' (not backslashed nor in brackets) can be used to denote the beginning and the end of the input text (thus preventing multiple matches and substitutions).
Examples:

echo "toto" | asubst "^t" "x"   ->  "xoto"
echo "toto" | asubst "o$" "i"   ->  "toti"

Multiple regex

If there are some "\n", then they delimit several regex that will apply to several lines.

The search pattern is split into several regex and \n patterns. Each line of input is split into a sequence of input items: text and newline. Pattern 1 will be compared to item 1. If they match then pattern 2 will be compared to item 2…
Examples:

^\n      -> ^$ \n            (empty line)
r1\nr2\n -> r1$ \n and ^r2$
r1\nr2   -> r1$ \n and ^r2
r1\nr2$  -> r1$ \n ^r2$ and \n
r1$\n    -> r1$ \n

Note

The difference between "r1\n" and "r1$" is that "r1\n" is two regex. With "r1\n", the string matching "r1" AND the following newline will be substituted. With "r1$", only the string matching "r1" will be substituted.
In short, "r1\n" means "r1 and a newline" while "r1$" means "r1, followed by a newline".

Common considerations and concepts

The following specific '\' sequences are also handled by asubst:

"\n", replaced by a line feed (when a specific delimiter is set),
"\s", replaced by a space,
"\t", replaced by a (horizontal) tab,
"\xIJ" where IJ is an hexadecimal number (00 to FF or ff) for a byte code.

"\x00" is forbidden in the find pattern (except in Noregex mode). Beware also that "\xIJ" is replaced before compiling the regex, so it cannot be used to avoid a given character. For example "[a\x2Dx]" is exactly the same as "[a-x]", and "\x0A" is exactly the same as "\n" (it will be used as regex delimiter).

The following shortcuts for character classes, to be used in regex mode and within a bracket expression:

"\M" [:alnum:], "\A" [:alpha:], "\B" [:blank:], "\C" [:cntrl:],
"\D" [:digit:], "\G" [:graph:], "\L" [:lower:], "\P" [:print:],
"\T" [:punct:], "\S" [:space:], "\U" [:upper:], "\X" [:xdigit:].

Note	Asubst does NOT check that such shortcut is within a bracket expression. For example, "\M" is always replaced by "[:alnum:]". Out of a bracket expression this becomes itself a bracket expression (with ':' 'a' 'l' 'n' 'u' 'm' and ':' characters), which fortunately is invalid for PCRE.

"\RIJ" where IJ is an hexadecimal number will be replaced by the string matching the previous regex number IJ, (00 < IJ < CurrentRegex).

Note
For regex numbers, each single regex and each delimiter counts for one.

"\rIJ" where IJ is an hexadecimal number will be replaced by the string matching the substring J of the previous regex number I (0 < I < CurrentRegex), J = 0 for the complete string, (so "\R0x" = "\rx0").

Note

* Substrings are numbered in the order of the opening parentheses, left to right.
* Such back reference to a previous match degrades the performance because the regex is re-compiled at each try.
* For back reference within the current regex, use the "\i" notation of the standard regex syntax.
* Beware that "\i" is processed by PCRE and refers to the current regex, while "\rIJ" and "\RIJ" are processed by asubst and refer to a previous regex.

Note	Significant '^' and '$' make sense only respectively at the beginning and at the end of a regex, otherwise no text will match. In case of several regex these markers are implicitly added around the delimiters if needed.

Note

Beware that some regex expressions are supposed to match a "new line" (example [:blank:] within a bracket expression) but will not match in asubst, because the input flow/file is decomposed into lines first (except if a specific delimiter is provided), then each line is compared to a regex expression.

Note	In "regex" mode, other '\' expressions (like "\i" for intra-regex back reference, or "\b" for word boundary) are passed to PCRE.

5. Exclusion pattern

This option means that the text matching the find pattern but also strictly not matching the exclusion pattern will not be replaced.

Note

Beware that it applies only to the fragment of text that matches the find pattern, not the full input text.
Strictly means that the full text must match the full pattern, e.g. "to" does not strictly match "o" but strictly matches ".o". So, in the exclusion regex a leading '^' and a tailing '$' are meaningless".

If set, the <exclude_regex> must have the same number of regex as the find pattern.
Exclusion is allowed in the noregex mode, and is then interpreted as a normal string, which is not very usefull.

For example echo "toto" | asubst --exclude=toto to ti will replace toto by titi because each matching "to" does not match the exclusion "toto". Same with --exclude=o.
On the other hand, echo "toto" | asubst --exclude=to t. ti and echo "toto" | asubst --exclude=toto toto titi will not substitute.

6. Match range

The <range> has the format: i-j,k-l… Example: "-4,7-9,15-" means the first 4 matching occurrences, the 7th to the 9th and 15th and following occurrences. Only these occurrences of matching (in each file or flow, after applying exclusion) will be substituted.

Note	"" means none, "-" means all.

Note	Because of the rules for parsing arguments, a <range> starting by '-' requires the long option name. Example: don’t use "-m -4", but "-m 1-4" or "--match=-4".

7. Delimiter

Normally, asubst reads the input flow or file line by line (separated by '\n'), and searches for matches between

the input sequences of text and '\n'
the sequence of regex (or strings in noregex mode) and '\n' of the find pattern.

However, when an input delimiter is specified (option -D or --delimiter) then asubst reads sequences of text separated by this delimiter (the whole text if delimiter is empty). Asubsts also parses the find and the exclusion patterns according to this delimiter.

The delimiter string is not a regular expression but can contain "\n", "\s", "\t" and "\xIJ" (even "\x00").

Setting a delimiter to the empty string can be useful for applying a criteria to the whole file, but the file is then read and processed all at once (so it must not be too large).

Setting a delimiter to a specific string is useful to process a (big) file line by line when the line delimiter of the file is not '\n'.

Note	Special regex characters (e.g. +, ?, ., *…) must not be backslashed in the delimiter, even if they appear in the find pattern.

Examples:

Remove comments (and possible trailing new_line) from a XML file:

asubst -D '' '<!--([^-]|-[^-])*-->\n?' '' file

Tag duplicated words in paragraphs (paragraphs are separated by two \n):

asubst -D '\n\n' -- '\b([\M]+)\b[\S]+\1' '->\R01<-' file

Fix line delimiters of a hudge file (see also section Noregex modes):

asubst -D '\x0D' -x '\x0D' '\x0A' file

If a delimiter is specified, then the option -d (--dotall) can also be used to allow '.' in find pattern to match '\n' in the text.

8. Replace string

The following '\' sequences are supported in the replace string:

"\n" will be replaced by a newline.
"\s" will be replaced by a space.
"\t" will be replaced by a (horizontal) tab.
"\xIJ" where IJ is an hexadecimal number (00 to FF or ff) will be replaced by a byte with the corresponding value.
"\RIJ" where IJ is an hexadecimal number will be replaced by the string matching the regex number IJ, (00 <= IJ <= NbOfRegex, 00 means the whole string matching all the regexes). For regex numbers, each single regex and each delimiter counts for one).
"\rIJ" where IJ is an hexadecimal number will be replaced by the string matching the substring J of the regex number I (0 < I), J = 0 for the complete string, (so "\R0x" = "\rx0").
"\iIJ", "\aIJ", "\oIJ", "\e" and "\f" to replace by the <text> following it, if Jth substring of the Ith regex matches (i.e. \rIJ is not empty).

The logic is a sequence of if… [ { elsif… } ] [ else… ] [ endif ]. Each if or elsif must start with a "\iIJ", possibly immediately followed by one or several "\aIJ" or "\oIJ" evaluated one after the other. For example, "\i11\o12\a13T1\i13T2\eT3\f" means: If sub11 matches or sub12 matches and sub13 matches, then replace by T1, elsif sub13 matches, then replace by T2, else replace by T3.

<text> ends when encountering another "\i", a "\e" or a "\f".
"\K"<shell command>"\k", within which "\RIJ" and "\rIJ" are first replaced, then the command is launched (and must exit with 0), then the whole command directive is replaced by the command output.
"\P"<file path>"\p", within which "\RIJ" and "\rIJ" are first replaced, then whole directive is replaced by the content of the file.
"\u", "\l", "\m" for starting an UPPER, lower or Mixed case conversion. "\c" for stopping a case conversion. A conversion ends when a new one starts, or on "\c".

Note	Conditions apply first, then replacement, then case conversion.

Note	Substrings are numbered in the order of the opening parenthesis, left to right.

Note	"\r0J", "\i0J", "\a0J" and "\o0J" are forbidden.

Note	All these escape sequences of regex mode are supported in noregex mode except "\rIJ", "\iIJ" (thus "\aIJ", "\oIJ", "\e" and "\f") because there is no notion of substring in noregex mode.

Examples:

echo -en "toto" | asubst "t" "\x40"                     ->  "@o@o"
echo -en "toto\ntiti\ntata\n" | asubst ".*\n" "\R01"    ->  "tototititata"
echo -en "\ntoto\ntiti\ntata" | asubst "\n.*" "\R02"    ->  "tototititata"
echo -en "tito" | asubst "(.i)(.o)"  "\r12<->\r11"      ->  "to<->ti"
echo -en "tito" | asubst ".*" "\m\R01\c"                ->  "Tito"
echo -en "toto\ntiti\n" | asubst "to\nti" "\u\R00\c"    ->  "toTO\nTIti\n"
echo -en "toto\ntiti\n" | asubst "t.\nt(.)" "\ut\r31\c" ->  "toTIti\n"
echo -en "toto" | asubst "(.)(.)(..)" "\u\r12\c\r11"    ->  "Otto"

9. Grep mode

In grep mode, asubst only looks for the find pattern and does not alter the files.
Grep mode is triggered by any of the following options:

With option -g (--grep), asubst displays the matching text, possibly with the input file name (option file) and possibly with the line number in the input file (options file and line).
An empty replace_string leads to display the matching text, but in grep mode without file mode it is possible to provide a non empty replace_string, so that asubst displays the substitution. This is especially useful if the replace_string contains some "\RIJ" or "\rIJ".
With option -L (--list), asubst displays once each file name where there is at least one matching.

In both cases (grep or list) the option -I (--invertmatch) makes asubst show the lines or files that don’t match, instead of those which match. It applies after applying exclusion and match range.

The options grep and list each impose quiet, test and no backup modes.

The options file, list and invertmatch each impose an empty replace string.

The option invertmatch imposes a single regex (i.e. with no delimiter).

Examples:

echo -en "toto" | asubst -gf "toto" ""                  -> toto
echo -en "toto" | asubst -g "toto" "found"              -> found
echo -en "*cs add" | asubst -g "^\*([^ ]+) " "Got \r11" -> Got cs

10. Noregex mode

General principal

In noregex mode asubst considers the find pattern as a sequence of text chunks separated by delimiters.

if a chunk (in the find pattern) is preceded by the delimiter then a matching text must start with the chunk
if a chunk (in the find pattern) is followed by the delimiter then a matching text must end with the chunk

Example, in noregex mode:

"\ntoto\n" matches "to" twice
"\ntoto\n" matches "\nto" or "to\n" once
"\ntoto\n" does not match "\nto\n"

Others considerations on regex and noregex modes

In both modes "\x0A" is the same as "\n", and "\s" and "\t" are interpreted.

In regex mode "\x00" is forbidden in find pattern because it is the C string terminator and it is used as such by the regex library. It is allowed in noregex mode.
All other hexadecimal numbers are supported in both modes.

In noregex mode "\A" to "\X" are not interpreted, like all other regex specific characters (including "^" and "$").

In noregex mode, "\rIJ" with J > 0 is forbidden (because there is no substring). Still "\rI0" can denote the Ith section of text.
Similarly, "\iI0", "\aI0" and "\oI0" are allowed in noregex mode, despite not very useful.

Optimization in noregex mode

In noregex mode, if the substitution (not grep) has no exclusion and replaces one single character by another character, then an internal optimization allows a significant improvement of performance.

Example: a reasonably small file, of lines separated by "\x0D", can be read all at once and fixed by replacing each "\x0D" character by a line feed.

asubst -x '\x0D' '\x0A' file

Example: If the file is hudge, it can be read line by line and fixed with:

asubst -D '\x0D' -x '\x0D' '\x0A' file

11. File list

Option -F <file_list> or --file-list=<file_list> allows specifying a file that contains the list of files to process, instead of specifying these files as arguments. In this case no <file> argument is accepted.

"-" as <file_list> denotes stdin.

The <file_list> must contain one file name per line and empty lines are skipped (so stdin can be specified by "-" in the file list, except if the list is already being read from stdin).

Examples:

asubst -F list toto titi
echo -e "file1\nfile2" | asubst -F - toto titi

12. Exit code

Asubst exits with 0 if some matching was found, with 1 if no matching was found, with 2 otherwise (e.g. in case of error). It can also exit with code 3 if stopped (Ctrl-C or SIGTERM) while executing an external command, or with the standard code: 128 + signal_number.

13. Elements of design

This section briefly describes how asubst is decomposed and provides clues on how it behaves.

13.1. Main units

Search_Pattern

This package has two main services:

parse the delimiter as a string, and parse the search and exclude patterns as follows:
- replace all '\' sequences except "\R" and "\r",
- split the pattern according to the delimiter,
- compile the regexes. Prepend a '^' if the regex is preceded by a delim (and does not yet start by a '^'). Same with a tailing '$'.
- store a list of alternated chunks: delimiter(s) and regex, e.g. <regex><delim><delim><regex><delim>. There is a list for the search pattern and another one for the exclusion pattern.
check a string versus one chunk (either of search pattern or of exclusion pattern) as follows:
- a delimiter only matches <delim> and vice versa,
- back references are replaced by the matching (sub)strings and the regex is recompiled,
- the string is checked versus the regex. If it matches, then the string and matching substrings are stored together with the chunk.

Replace_Pattern

This package has has two main services:

parse the replace string
generate the replacing string by:
- processing conditions,
- inserting the text of references to matching (sub)strings, the result of shell commands or the content of file,
- converting the result to upper, lower or mixed char.

Substit

This package is in charge of opening, processing and closing one file.

In case of iterative search pattern, (a pattern that can apply several times to a line, i.e. no delimiter, no '^' nor '$'), it:
- reads one line of input,
- iterates on each char of the line: If the current tail of the line matches the search pattern and does not match the exclusion, then it replaces the matching string and jumps to the end of the replaced string.
Otherwise it:
- reads a number of lines corresponding to the number of chunks
- checks each line versus a chunk of the search pattern
- if they all match, then checks each line versus the exclusion pattern
- if no overall match, then it shifts one line and reads a new line, otherwise it replaces the whole matching text and keeps the tail of last line for next check if necessary.

Asusbt

This is the main procedure, which:

parses the arguments,
iterates the substitution on all the files,
consolidates the status.

13.2. Multiple-regex find pattern

Decomposition of the find pattern

As seen above, the delimiter (if any) is used to decompose the find pattern into a sequence of single regexes, and by definition there is always at least one delimiter between two regexes.
During the search for a match, asubst checks each input line (that is read according to the delimiter) versus one simple regex of the sequence.

No delimiter in regex

Because of this logic, the delimiter cannot appear in a regex. In other words, a string between two delimiters must be a complete valid regex. This forbids find pattern like:

text\n?text
text(\n|\t)text

Such expressions are valid only if '\n' is not the delimiter.

Back references

Because of this logic there are two kinds of back references:

a back reference to a "local" substring (local to this regex) uses the standard notation "\i" and is quite efficiently handled by the PCRE regex engine.
a back reference to a (sub)string of a previous regex uses the notation "\Rij" or "\rij" and is handled by asubst itself. Asubst replaces the reference by the corresponding (sub) string and (in regex mode) re-computes the regex, which is not very efficient.

14. Troubleshooting

Asubst is stuck.
→ Maybe an argument is missing or an option is not complete, and asubst is waiting on stdin.
Example: with "asubst -m 5 6 file" asubst considers "5" as the value for m, "6" as the find pattern and "file" as the replace string, so that no file name is provided.
I want to substitute "\f", why do I need to provide "\\\\f"?
→ Because regex imposes to provide "\\f" for matching "\f", ("\f" matches "f") and the shell imposes a "\\" for each "\" transmitted to asubst.

Note that you can pass to the shell '\\f' instead of "\\\\f".

Same considerations apply to "\$" or '$' in order to pass $ to asubst.
ERROR: Cannot create temp file in ".".
→ Asubst cannot make the temporary file, probably because the file system does not support hard links (e.g. samba file system). Try the -p (--tmp) option with a UNIX directory.
The else part of the condition is not processed.
→ The whole replace string is not processed if the input text does not match the find pattern.
Example, with "([A-Z][a-z])|([0-1])" "\i11Letter\i12Digit\eOther\f" the input text "!" remains unchanged. Appending "|." to the search pattern makes it match and be substituted by Other.
Exclusion does not work.
→ The exclusion pattern applies to, only to, and strictly to the whole text matching the find pattern. Examples:
echo toto | asubst -e to t v substitutes to vovo because each matching "t" does not match "to",
echo toto | asubst -e o t. v substitutes to vv because each "to" does not strictly match "o",
echo toto | asubst -e .o t. v does not substitute because each "to" matches ".o".
Optional match at beginning or end of line doesn’t work.
→ PCRE does no match "toto" with for instance "[\B]?$" or "^[\B]?", so an attempt to replace "[\B]?\n[\B]?" by "\n" will not allow detecting "toto \ntiti" nor "toto\n titi", because the 3 patterns are tested one after the other.
Replacing "(.*)[\B]?\n[\B]?(.*)" by "\r11\nr31" will work, as well as a change of the delimiter.