Asubst is a command line tool that substitutes text within an input flow of text. It finds occurences of text that match a pattern (regular expression), and replaces these occurrences by a given string.
1. General syntax
Usage: asubst [ { <option > } ] <find_pattern> <replace_string> [ { <file> } ]
or : asubst -h | --help | -v | --version
Substitutes strings in files.
If no file is provided or if <file> is "-", then substitutes from stdin to stdout.
Warning
|
Regular expressions are powerful and automatic substitution can be
dangerous, so use asubst with caution. Test your pattern with:echo string | asubst <search_pattern> <replace_string> and use option -s or -tV if unsure. |
See "man 3 pcre" for Perl Compatible Regular Expressions and "man 1 perlre" for Perl Regular Expressions.
2. Options
-a or --ascii : consider that the input flow or file is pure ASCII,
-D <string> or --delimiter=<string> for a delimiter other than \n,
-d or --dotall for allow "." to match "\n", when -D is set,
-e <pattern> or --exclude=<pattern> for skip text matching <pattern>,
-F <file> or --file-list=<file> to provide a file list of file names,
-f or --file for display the matching file names in grep mode,
-g or --grep : print matching text as grep would do (no substitution),
-i or --ignorecase : do case insensitive match checking of <find_pattern>,
-I or --invertmatch : invert grep matching as "grep -v" would do,
-L or --list : print the files matching text as "grep -l" would do (no
substitution),
-l or --line for display the line numbers in grep mode,
-m <range> or --match=<range> for substitution of only <range> matches,
-n or --number : print number of substitutions per file,
-p <dir> or --tmp=<dir> : directory for temporary files,
-q or --quiet : no printout,
-s or --save : make a backup (<file>.asu) of the original file(s),
-t or --test for test, substitutions are not performed,
-u or --utf8 : process utf-8 sequences,
-V or --verbose : print each substitution and its line number,
-x or --noregex : <find_pattern> is considered as string(s),
-- : to stop the list of options. If <find_pattern> is "-pat", then use:
asubst [ { <option> } ] -- "-pat" <replace_pattern> [ { <file> } ]
.
3. Environment
If the LANG environment variable is set to a string containing "UTF-8" or "UTF8" (with whatever casing), then asubst considers that the input flow or file is encoded in Utf-8. This setting can be altered in a way or the other by setting ASUBST_UTF8 environment variable to "Y", "yes", "N" or "no" (with whatever casing). Finally, this setting can be altered in a way or the other by the options -a (--ascii), or -u (--utf8).
If asubst_TRACE environment variable is set to "Debug" (with whatever casing) then asubst puts on stderr some debugging information. asubst_TRACE_Substit, asubst_TRACE_Search and asubst_TRACE_Replace can also be set to "Debug" for more detailed debugging information.
4. Find pattern
This section describes the find pattern in regex mode. See section Noregex mode for a description of find string in noregex mode.
<find_pattern> ::= <single_regex> | <multiple_regex>
<multiple_regex> ::= { [ <single_regex> ] \n } [ <single_regex> ]
If there is no "\n", then the regex can be any regular expression.
If it does not start with '^' nor ends with '$', then it is applied several
times to each input line (applied each time to the text following the previous
substitution).
Examples:
echo "toto" | asubst "t" "x" -> "xoxo"
echo "toto" | asubst "t" "ti" -> "tiotio"
echo "tito" | asubst "t." "tu" -> "tutu"
Significant '^' and '$' (not backslashed nor in brackets) can be used to
denote the beginning and the end of the input text (thus preventing multiple
matches and substitutions).
Examples:
echo "toto" | asubst "^t" "x" -> "xoto"
echo "toto" | asubst "o$" "i" -> "toti"
If there are some "\n", then they delimit several regex that will apply to several lines.
The search pattern is split into several regex and \n patterns. Each line of
input is split into a sequence of input items: text and newline. Pattern 1
will be compared to item 1. If they match then pattern 2 will be compared to
item 2…
Examples:
^\n -> ^$ \n (empty line)
r1\nr2\n -> r1$ \n and ^r2$
r1\nr2 -> r1$ \n and ^r2
r1\nr2$ -> r1$ \n ^r2$ and \n
r1$\n -> r1$ \n
Note
|
The difference between "r1\n" and "r1$" is that "r1\n" is two regex. With
"r1\n", the string matching "r1" AND the following newline will be
substituted. With "r1$", only the string matching "r1" will be substituted. In short, "r1\n" means "r1 and a newline" while "r1$" means "r1, followed by a newline". |
The following specific '\' sequences are also handled by asubst:
-
"\n", replaced by a line feed (when a specific delimiter is set),
-
"\s", replaced by a space,
-
"\t", replaced by a (horizontal) tab,
-
"\xIJ" where IJ is an hexadecimal number (00 to FF or ff) for a byte code.
"\x00" is forbidden in the find pattern (except in Noregex mode). Beware also that "\xIJ" is replaced before compiling the regex, so it cannot be used to avoid a given character. For example "[a\x2Dx]" is exactly the same as "[a-x]", and "\x0A" is exactly the same as "\n" (it will be used as regex delimiter).
-
The following shortcuts for character classes, to be used in regex mode and within a bracket expression:
"\M" [:alnum:], "\A" [:alpha:], "\B" [:blank:], "\C" [:cntrl:], "\D" [:digit:], "\G" [:graph:], "\L" [:lower:], "\P" [:print:], "\T" [:punct:], "\S" [:space:], "\U" [:upper:], "\X" [:xdigit:].
NoteAsubst does NOT check that such shortcut is within a bracket expression. For example, "\M" is always replaced by "[:alnum:]". Out of a bracket expression this becomes itself a bracket expression (with ':' 'a' 'l' 'n' 'u' 'm' and ':' characters), which fortunately is invalid for PCRE. -
"\RIJ" where IJ is an hexadecimal number will be replaced by the string matching the previous regex number IJ, (00 < IJ < CurrentRegex).
NoteFor regex numbers, each single regex and each delimiter counts for one. -
"\rIJ" where IJ is an hexadecimal number will be replaced by the string matching the substring J of the previous regex number I (0 < I < CurrentRegex), J = 0 for the complete string, (so "\R0x" = "\rx0").
Note* Substrings are numbered in the order of the opening parentheses, left to right.
* Such back reference to a previous match degrades the performance because the regex is re-compiled at each try.
* For back reference within the current regex, use the "\i" notation of the standard regex syntax.
* Beware that "\i" is processed by PCRE and refers to the current regex, while "\rIJ" and "\RIJ" are processed by asubst and refer to a previous regex.
Note
|
Significant '^' and '$' make sense only respectively at the beginning and at the end of a regex, otherwise no text will match. In case of several regex these markers are implicitly added around the delimiters if needed. |
Note
|
Beware that some regex expressions are supposed to match a "new line" (example [:blank:] within a bracket expression) but will not match in asubst, because the input flow/file is decomposed into lines first (except if a specific delimiter is provided), then each line is compared to a regex expression. |
Note
|
In "regex" mode, other '\' expressions (like "\i" for intra-regex back reference, or "\b" for word boundary) are passed to PCRE. |
5. Exclusion pattern
This option means that the text matching the find pattern but also strictly not matching the exclusion pattern will not be replaced.
Note
|
Beware that it applies only to the fragment of text that matches the
find pattern, not the full input text. Strictly means that the full text must match the full pattern, e.g. "to" does not strictly match "o" but strictly matches ".o". So, in the exclusion regex a leading '^' and a tailing '$' are meaningless". |
If set, the <exclude_regex> must have the same number of regex as the
find pattern.
Exclusion is allowed in the noregex mode, and is then interpreted
as a normal string, which is not very usefull.
For example echo "toto" | asubst --exclude=toto to ti will replace toto by
titi because each matching "to" does not match the exclusion "toto". Same with
--exclude=o.
On the other hand, echo "toto" | asubst --exclude=to t. ti and echo "toto"
| asubst --exclude=toto toto titi will not substitute.
6. Match range
The <range> has the format: i-j,k-l… Example: "-4,7-9,15-" means the first 4 matching occurrences, the 7th to the 9th and 15th and following occurrences. Only these occurrences of matching (in each file or flow, after applying exclusion) will be substituted.
Note
|
"" means none, "-" means all. |
Note
|
Because of the rules for parsing arguments, a <range> starting by '-' requires the long option name. Example: don’t use "-m -4", but "-m 1-4" or "--match=-4". |
7. Delimiter
Normally, asubst reads the input flow or file line by line (separated by '\n'), and searches for matches between
-
the input sequences of text and '\n'
-
the sequence of regex (or strings in noregex mode) and '\n' of the find pattern.
However, when an input delimiter is specified (option -D or --delimiter) then asubst reads sequences of text separated by this delimiter (the whole text if delimiter is empty). Asubsts also parses the find and the exclusion patterns according to this delimiter.
The delimiter string is not a regular expression but can contain "\n", "\s", "\t" and "\xIJ" (even "\x00").
Setting a delimiter to the empty string can be useful for applying a criteria to the whole file, but the file is then read and processed all at once (so it must not be too large).
Setting a delimiter to a specific string is useful to process a (big) file line by line when the line delimiter of the file is not '\n'.
Note
|
Special regex characters (e.g. +, ?, ., *…) must not be backslashed in the delimiter, even if they appear in the find pattern. |
Examples:
Remove comments (and possible trailing new_line) from a XML file:
asubst -D '' '<!--([^-]|-[^-])*-->\n?' '' file
Tag duplicated words in paragraphs (paragraphs are separated by two \n):
asubst -D '\n\n' -- '\b([\M]+)\b[\S]+\1' '->\R01<-' file
Fix line delimiters of a hudge file (see also section Noregex modes):
asubst -D '\x0D' -x '\x0D' '\x0A' file
If a delimiter is specified, then the option -d (--dotall) can also be used to allow '.' in find pattern to match '\n' in the text.
8. Replace string
The following '\' sequences are supported in the replace string:
-
"\n" will be replaced by a newline.
-
"\s" will be replaced by a space.
-
"\t" will be replaced by a (horizontal) tab.
-
"\xIJ" where IJ is an hexadecimal number (00 to FF or ff) will be replaced by a byte with the corresponding value.
-
"\RIJ" where IJ is an hexadecimal number will be replaced by the string matching the regex number IJ, (00 <= IJ <= NbOfRegex, 00 means the whole string matching all the regexes). For regex numbers, each single regex and each delimiter counts for one).
-
"\rIJ" where IJ is an hexadecimal number will be replaced by the string matching the substring J of the regex number I (0 < I), J = 0 for the complete string, (so "\R0x" = "\rx0").
-
"\iIJ", "\aIJ", "\oIJ", "\e" and "\f" to replace by the <text> following it, if Jth substring of the Ith regex matches (i.e. \rIJ is not empty).
The logic is a sequence of if… [ { elsif… } ] [ else… ] [ endif ]. Each if or elsif must start with a "\iIJ", possibly immediately followed by one or several "\aIJ" or "\oIJ" evaluated one after the other. For example, "\i11\o12\a13T1\i13T2\eT3\f" means: If sub11 matches or sub12 matches and sub13 matches, then replace by T1, elsif sub13 matches, then replace by T2, else replace by T3.
<text> ends when encountering another "\i", a "\e" or a "\f".
-
"\K"<shell command>"\k", within which "\RIJ" and "\rIJ" are first replaced, then the command is launched (and must exit with 0), then the whole command directive is replaced by the command output.
-
"\P"<file path>"\p", within which "\RIJ" and "\rIJ" are first replaced, then whole directive is replaced by the content of the file.
-
"\u", "\l", "\m" for starting an UPPER, lower or Mixed case conversion. "\c" for stopping a case conversion. A conversion ends when a new one starts, or on "\c".
Note
|
Conditions apply first, then replacement, then case conversion. |
Note
|
Substrings are numbered in the order of the opening parenthesis, left to right. |
Note
|
"\r0J", "\i0J", "\a0J" and "\o0J" are forbidden. |
Note
|
All these escape sequences of regex mode are supported in noregex mode except "\rIJ", "\iIJ" (thus "\aIJ", "\oIJ", "\e" and "\f") because there is no notion of substring in noregex mode. |
Examples:
echo -en "toto" | asubst "t" "\x40" -> "@o@o"
echo -en "toto\ntiti\ntata\n" | asubst ".*\n" "\R01" -> "tototititata"
echo -en "\ntoto\ntiti\ntata" | asubst "\n.*" "\R02" -> "tototititata"
echo -en "tito" | asubst "(.i)(.o)" "\r12<->\r11" -> "to<->ti"
echo -en "tito" | asubst ".*" "\m\R01\c" -> "Tito"
echo -en "toto\ntiti\n" | asubst "to\nti" "\u\R00\c" -> "toTO\nTIti\n"
echo -en "toto\ntiti\n" | asubst "t.\nt(.)" "\ut\r31\c" -> "toTIti\n"
echo -en "toto" | asubst "(.)(.)(..)" "\u\r12\c\r11" -> "Otto"
9. Grep mode
In grep mode, asubst only looks for the find pattern and does not alter the
files.
Grep mode is triggered by any of the following options:
-
With option -g (--grep), asubst displays the matching text, possibly with the input file name (option file) and possibly with the line number in the input file (options file and line).
An empty replace_string leads to display the matching text, but in grep mode without file mode it is possible to provide a non empty replace_string, so that asubst displays the substitution. This is especially useful if the replace_string contains some "\RIJ" or "\rIJ". -
With option -L (--list), asubst displays once each file name where there is at least one matching.
In both cases (grep or list) the option -I (--invertmatch) makes asubst show the lines or files that don’t match, instead of those which match. It applies after applying exclusion and match range.
The options grep and list each impose quiet, test and no backup modes.
The options file, list and invertmatch each impose an empty replace string.
The option invertmatch imposes a single regex (i.e. with no delimiter).
Examples:
echo -en "toto" | asubst -gf "toto" "" -> toto
echo -en "toto" | asubst -g "toto" "found" -> found
echo -en "*cs add" | asubst -g "^\*([^ ]+) " "Got \r11" -> Got cs
10. Noregex mode
In noregex mode asubst considers the find pattern as a sequence of text chunks separated by delimiters.
-
if a chunk (in the find pattern) is preceded by the delimiter then a matching text must start with the chunk
-
if a chunk (in the find pattern) is followed by the delimiter then a matching text must end with the chunk
Example, in noregex mode:
"\ntoto\n" matches "to" twice
"\ntoto\n" matches "\nto" or "to\n" once
"\ntoto\n" does not match "\nto\n"
In both modes "\x0A" is the same as "\n", and "\s" and "\t" are interpreted.
In regex mode "\x00" is forbidden in find pattern because it is the C string
terminator and it is used as such by the regex library. It is allowed in
noregex mode.
All other hexadecimal numbers are supported in both modes.
In noregex mode "\A" to "\X" are not interpreted, like all other regex specific characters (including "^" and "$").
In noregex mode, "\rIJ" with J > 0 is forbidden (because there is no
substring). Still "\rI0" can denote the Ith section of text.
Similarly, "\iI0", "\aI0" and "\oI0" are allowed in noregex mode, despite not
very useful.
In noregex mode, if the substitution (not grep) has no exclusion and replaces one single character by another character, then an internal optimization allows a significant improvement of performance.
Example: a reasonably small file, of lines separated by "\x0D", can be read all at once and fixed by replacing each "\x0D" character by a line feed.
asubst -x '\x0D' '\x0A' file
Example: If the file is hudge, it can be read line by line and fixed with:
asubst -D '\x0D' -x '\x0D' '\x0A' file
11. File list
Option -F <file_list> or --file-list=<file_list> allows specifying a file that contains the list of files to process, instead of specifying these files as arguments. In this case no <file> argument is accepted.
"-" as <file_list> denotes stdin.
The <file_list> must contain one file name per line and empty lines are skipped (so stdin can be specified by "-" in the file list, except if the list is already being read from stdin).
Examples:
asubst -F list toto titi
echo -e "file1\nfile2" | asubst -F - toto titi
12. Exit code
Asubst exits with 0 if some matching was found, with 1 if no matching was found, with 2 otherwise (e.g. in case of error). It can also exit with code 3 if stopped (Ctrl-C or SIGTERM) while executing an external command, or with the standard code: 128 + signal_number.
13. Elements of design
This section briefly describes how asubst is decomposed and provides clues on how it behaves.
13.1. Main units
This package has two main services:
-
parse the delimiter as a string, and parse the search and exclude patterns as follows:
-
replace all '\' sequences except "\R" and "\r",
-
split the pattern according to the delimiter,
-
compile the regexes. Prepend a '^' if the regex is preceded by a delim (and does not yet start by a '^'). Same with a tailing '$'.
-
store a list of alternated chunks: delimiter(s) and regex, e.g. <regex><delim><delim><regex><delim>. There is a list for the search pattern and another one for the exclusion pattern.
-
-
check a string versus one chunk (either of search pattern or of exclusion pattern) as follows:
-
a delimiter only matches <delim> and vice versa,
-
back references are replaced by the matching (sub)strings and the regex is recompiled,
-
the string is checked versus the regex. If it matches, then the string and matching substrings are stored together with the chunk.
-
This package has has two main services:
-
parse the replace string
-
generate the replacing string by:
-
processing conditions,
-
inserting the text of references to matching (sub)strings, the result of shell commands or the content of file,
-
converting the result to upper, lower or mixed char.
-
This package is in charge of opening, processing and closing one file.
-
In case of iterative search pattern, (a pattern that can apply several times to a line, i.e. no delimiter, no '^' nor '$'), it:
-
reads one line of input,
-
iterates on each char of the line: If the current tail of the line matches the search pattern and does not match the exclusion, then it replaces the matching string and jumps to the end of the replaced string.
-
-
Otherwise it:
-
reads a number of lines corresponding to the number of chunks
-
checks each line versus a chunk of the search pattern
-
if they all match, then checks each line versus the exclusion pattern
-
if no overall match, then it shifts one line and reads a new line, otherwise it replaces the whole matching text and keeps the tail of last line for next check if necessary.
-
This is the main procedure, which:
-
parses the arguments,
-
iterates the substitution on all the files,
-
consolidates the status.
13.2. Multiple-regex find pattern
As seen above, the delimiter (if any) is used to decompose the find pattern
into a sequence of single regexes, and by definition there is always at least
one delimiter between two regexes.
During the search for a match, asubst checks each input line (that is read
according to the delimiter) versus one simple regex of the sequence.
Because of this logic, the delimiter cannot appear in a regex. In other words, a string between two delimiters must be a complete valid regex. This forbids find pattern like:
text\n?text
text(\n|\t)text
Such expressions are valid only if '\n' is not the delimiter.
Because of this logic there are two kinds of back references:
-
a back reference to a "local" substring (local to this regex) uses the standard notation "\i" and is quite efficiently handled by the PCRE regex engine.
-
a back reference to a (sub)string of a previous regex uses the notation "\Rij" or "\rij" and is handled by asubst itself. Asubst replaces the reference by the corresponding (sub) string and (in regex mode) re-computes the regex, which is not very efficient.
14. Troubleshooting
-
Asubst is stuck.
→ Maybe an argument is missing or an option is not complete, and asubst is waiting on stdin.
Example: with "asubst -m 5 6 file" asubst considers "5" as the value for m, "6" as the find pattern and "file" as the replace string, so that no file name is provided. -
I want to substitute "\f", why do I need to provide "\\\\f"?
→ Because regex imposes to provide "\\f" for matching "\f", ("\f" matches "f") and the shell imposes a "\\" for each "\" transmitted to asubst.Note that you can pass to the shell '\\f' instead of "\\\\f".
Same considerations apply to "\$" or '$' in order to pass $ to asubst.
-
ERROR: Cannot create temp file in ".".
→ Asubst cannot make the temporary file, probably because the file system does not support hard links (e.g. samba file system). Try the -p (--tmp) option with a UNIX directory. -
The else part of the condition is not processed.
→ The whole replace string is not processed if the input text does not match the find pattern.
Example, with "([A-Z][a-z])|([0-1])" "\i11Letter\i12Digit\eOther\f" the input text "!" remains unchanged. Appending "|." to the search pattern makes it match and be substituted by Other. -
Exclusion does not work.
→ The exclusion pattern applies to, only to, and strictly to the whole text matching the find pattern. Examples:
echo toto | asubst -e to t v
substitutes to vovo because each matching "t" does not match "to",
echo toto | asubst -e o t. v
substitutes to vv because each "to" does not strictly match "o",
echo toto | asubst -e .o t. v
does not substitute because each "to" matches ".o". -
Optional match at beginning or end of line doesn’t work.
→ PCRE does no match "toto" with for instance "[\B]?$" or "^[\B]?", so an attempt to replace "[\B]?\n[\B]?" by "\n" will not allow detecting "toto \ntiti" nor "toto\n titi", because the 3 patterns are tested one after the other.
Replacing "(.*)[\B]?\n[\B]?(.*)" by "\r11\nr31" will work, as well as a change of the delimiter.