苹果的英文是什么| 拔牙挂什么科| 大创是什么| 为什么上小厕会有刺痛感| 78是什么意思| 不全纵隔子宫是什么意思| 打封闭针是什么意思| got什么意思| 镁高有什么症状和危害| 血氧是什么意思| 省政协委员是什么级别| 福报是什么| 左耳朵嗡嗡响是什么原因引起的| 拔罐有什么作用和功效| 犯病是什么意思| elisa是什么检测方法| 每天吃一个鸡蛋有什么好处| 染色体异常是什么意思| 牛百叶是什么部位| 呃逆什么意思| 印度洋为什么叫印度洋| 睡不着有什么好办法吗| 什么是甲减有什么症状| 花蛤不能和什么一起吃| 大便一粒一粒的是什么原因| 人流前需要检查什么项目| 唾液酸苷酶阳性是什么意思| 赖氨酸有什么作用| 小孩感冒发烧吃什么药| 上嘴唇发白是因为什么原因| 什么叫血栓| 关羽的武器叫什么| 男人好难做人好难是什么歌| 什么是邮箱地址应该怎么填写| 线索细胞阳性什么意思| 腿部青筋明显是什么原因| 十一月份什么星座| 看近视眼挂什么科| 野人是什么意思| 手脚心热是什么原因| 疾控中心是干什么的| 什么茶对胃好| 前列腺是什么原因引起的| 茹是什么意思| 什么的嘴| 欲是什么意思| 为什么会有扁桃体结石| 多喝水有什么好处坏处| 梦见春梦是什么意思| 砧木是什么意思| 无缘无故吐血是什么原因| 食管反流吃什么药| 疱疹长什么样子图片| 脖子淋巴结发炎吃什么药| 尿多尿频是什么原因造成的| 不发烧流鼻涕打喷嚏吃什么药| 混社会的人一般干什么| 铊是什么东西| 88.88红包代表什么意思| 止血芳酸又叫什么| 大专什么专业就业前景好| 稽留流产是什么意思| 梦到迁坟是什么意思| 栀子花什么时候开花| 对称是什么意思| 黄瓜与什么相克| 丙肝抗体阳性是什么意思呢| 什么动物三只爪| 什么的旋律| 岁寒三友是指什么| ochirly是什么品牌| 结扎对男的有什么影响| 肌无力是什么症状| 神经痛吃什么药效果好| 十一月三十是什么星座| 凌波鱼是什么鱼| 什么品牌的卫浴好| 可遇不可求什么意思| 什么可以醒酒| 身宫是什么意思| 湿疹吃什么食物好| 阴血亏虚吃什么中成药| 梦见钓了好多鱼是什么意思| 杨过是什么生肖| 羽毛球鞋什么牌子好| 醋纤是什么面料| 焦虑吃什么药| 什么样的嘴巴| bpo是什么意思| 蓝莓有什么功效| 荧光色是什么颜色| eo什么意思| 麦霸什么意思| 突然是什么意思| 为什么脚会有酸臭味| 放屁臭是什么原因| 6.7是什么星座| 神经节是什么| 为什么人会做梦| 什么人适合吃人参| 肾积水是什么症状| 骨折吃什么好得快| 吃什么降血脂最快最好| m型发际线适合什么发型| sassy是什么意思| 减肥晚上适合吃什么水果| 阴部痒是什么原因| 什么是超标电动车| 字母圈什么意思| 和能组什么词| egc是什么意思| 天数是什么意思| 香菜吃多了有什么坏处| 什么其谈| 猫代表什么象征意义| 梦见黑蛇是什么预兆| 南瓜为什么叫南瓜| 体检查什么| 白玉兰奖是什么级别的| 儿童贫血有什么症状表现| 奴仆宫是什么意思| 129什么星座| 亚麻籽是什么植物| 复检是什么意思| 428是什么意思| 黑加仑是什么水果| 自然卷的头发适合什么发型| 安琪儿是什么意思| 家属是什么意思| 梦见玻璃碎了什么意思| 吃什么东西| 四月初五是什么星座| 1月27号是什么星座| 大米粉做什么好吃| 日语亚麻跌是什么意思| 通便吃什么最快排便| 龙的本命佛是什么佛| 黄帝内经讲的是什么| 什么人不适合普拉提| 什么饮料解酒| 眼胀是什么原因| 斩衰是什么意思| 欧巴桑是什么意思| 黄鼻涕是什么原因| 猫三联什么时候打| 梵克雅宝为什么那么贵| 褪黑素是什么东西| aosc是什么病| 智商135是什么水平| ccd是什么意思| 老白茶属于什么茶| 腊八粥是什么节日| 三大精神是什么| 游坦之练的什么武功| 苹果醋什么时候喝最好| 儿童热伤风吃什么药| 脑电图是什么| ACG是什么牌子| 淡盐水是什么水| 阴历七月是什么星座| 亲吻是什么意思| 人工受孕和试管婴儿有什么区别| 白条鱼是什么鱼| 骨折移位有什么感觉| 瑄字五行属什么| oce是什么牌子| 老公梦见老婆出轨是什么意思| 救赎什么意思| cp是什么| 聊天是什么意思| 世界最大的岛是什么岛| 芸豆是什么| 6月6是什么节日| 喝白酒有什么好处| 老年人打嗝不止是什么原因| 女人性冷淡是什么原因| 儿童身份证需要什么材料| 茵陈是什么| 雪纺是什么面料| 11年属什么| 孕妇奶粉什么时候开始喝最好| 8月3日是什么日子| 什么是风湿热| 50年是什么婚| 心脏主要由什么组织构成| 感冒发烧不能吃什么食物| 牙龈萎缩是什么原因| 什么情况下需要打狂犬疫苗| 牛肉配什么菜包饺子好吃| 博文是什么意思| 子非鱼什么意思| 什么是犯太岁| 踏马什么意思| 2019年属什么| 补办护照需要什么材料| 孕妇梦到被蛇咬是什么意思| 疤痕增生是什么样子| 大蒜泡酒治什么病| 什么冠禽兽| 肾气不足是什么原因| 川芎的功效与作用是什么| 冬练三九夏练三伏是什么意思| 胎儿为什么会喜欢臀位| 眼皮重是什么原因| 婴儿胀气是什么原因| 长生殿讲的是什么故事| 鼻炎吃什么消炎药| 黄历冲生肖是什么意思| 青海是什么省| 瑜伽是什么意思| 4级残疾证有什么优惠政策| 肝功能查什么| 扁桃体发炎吃什么药| 群青色是什么颜色| 内膜欠均匀是什么意思| 子宫肌瘤什么不能吃| 68年属什么生肖多少岁| 中国的四大发明是什么| 做雾化起什么作用| 肺栓塞是什么意思| 菠菜什么时候传入中国| 下嘴唇发紫是什么原因| 快乐源泉是什么意思| 奄奄一息是什么意思| 月经期间吃什么好| 谷草谷丙偏低代表什么| 三月出生的是什么星座| 吃善存片有什么好处| fzl什么意思| 难怪是什么意思| 急性寻麻疹用什么药| 双币信用卡是什么意思| 精液是什么味道| 喝酒打嗝是什么原因| 什么水果利尿效果最好| 记过属于什么处分| pa是什么| 胜造七级浮屠是什么意思| 软骨炎吃什么药| 老是低血糖是什么原因| 亚临床甲减是什么意思| 什么东西越吃越饿| 幽门螺旋杆菌有什么危害| 9月份出生的是什么星座| 白细胞2个加号是什么意思| 厨娘是什么意思| 醉酒第二天吃什么才能缓解难受| 眼睛双重影什么原因| 眼皮肿什么原因引起的| 人言轻微是什么意思| 什么孕妇容易怀脑瘫儿| 憨厚老实是什么意思| 耐药性是什么意思| 青光眼什么症状| 68年猴五行属什么| 淀粉酶是查什么的| 皮肤黑穿什么颜色的衣服好看| 原生家庭是什么意思| 什么化痰效果最好最快| 牙龈出血挂什么科| 枫树的叶子像什么| 身上痣多是什么原因| 930是什么意思| 久视伤血是什么意思| 百度

投行是做什么的

[article index] [] [@mattmight] [rss]
百度 在空天海洋、信息网络、生命科学、核技术等关系未来的核心领域强化军民融合发展,再培育一批战略性新兴产业。

Unix is an alliance of loosely structured text files bound together and governed by scripts. Unix is the United Confederation of Strings:

The string is a stark data structure and
everywhere it is passed there is much duplication of process.
It is a perfect vehicle for hiding information.

--Alan Perlis

Tools built in the Unix tradition excel at manipulating strings as data.

Yet many newer Unix users are unaware of the classic tools and their power.

In this article, I'll provide a functional introduction to four important concepts and tools for sculpting text: regex, grep, sed and awk.

In short:

  • regex is a language for describing patterns in strings;
  • grep filters its input against a pattern;
  • sed applies transformation rules to each line; and
  • awk manipulates an ad hoc database stored as text, e.g. CSV files.

With this functional introduction, my goal is to introduce enough of each tool to cover 80%-90% of their niche uses cases.

Read on for a touch of history, theory and practice.


This post is part of a "Unix fundamentals" series; see basic Unix, and settling into Unix for more.

Theory: Regular languages

Many tools for searching and sculpting text rely on a pattern language known as regular expressions.

The theory of regular languages underpins regular expressions.

(Caveat: Some modern "regular" expression systems can describe irregular languages, which is why the term "regex" is preferred for these systems.)

Regular languages are a class of formal language equivalent in power to those recognized by deterministic finite automata (DFAs) and nondeterministic finite automata (NFAs).

[See my post on converting regular expressions to NFAs.]

In formal language theory, a language is a set of strings.

For example, {"foo"} and {"foo", "foobar"} are formal (if small) languages.

(Mathematicians don't typically put quotes around a string, preferring to let the fixed-width typewriter font distinguish it as one, but I'm guessing that programmers are more comfortable with the quotes around strings.)

In regular language theory, there are two atomic languages:

  • $\epsilon$ -- the null language, which contains the string of length zero; and
  • $\emptyset$ -- the empty language, which contains no strings at all.

In almost every programming language, the null string is written "".

Mathematicians are often sloppy with the notation for the null language, using $\epsilon$ to represent both the null language, {""}, and the null string, "".

For each character c in the alphabet, there is a corresponding one-character primitive language, {"c"}.

(The alphabet is a set of characters, usually denoted $\Sigma$ or $A$.)

Once again, mathematicians are often sloppy in their notation, using the character c to mean the language {"c"}.

Regular languages are those that can be obtained by unrestricted composition of the operations union, concatenation and Kleene star on the atomic and primitive languages:

  • The union of languages $L_1$ and $L_2$, written $L_1 \cup L_2$, is set union: \[ L_1 \cup L_2 = \{ x \mathrel{|} x \in L_1 \text{ or } x \in L_2 \} \text. \]
  • The concatenation of two languages $L_1$ and $L_2$, written $L_1 \circ L_2$, is akin to Cartesian product: \[ L_1 \circ L_2 = \{ \mathtt{"}xy\mathtt{"} \mathrel{|} \mathtt{"}x\mathtt{"} \in L_1 \text{ and } \mathtt{"}y\mathtt{"} \in L_2 \} \text. \] Concatenation is often written as juxtaposition: $L_1 L_2 = L_1 \circ L_2$.
  • The language $L$ to the $n$th power, written $L^n$, is the language contaning $n$ strings from $L$ concatenated together: \[ L^n = \{ \mathtt{"}x_1\cdots x_n\mathtt{"} \mathrel{|} \mathtt{"}x_i{"} \in L \text{ for all } i \text { between } 1 \text{ and } n \} \text. \] Of course, $L^0 = \epsilon$.
  • The Kleene star (the "possible empty repetition") of a language $L$, written $L^\star$, contains a language concatenated with itself for every possible combination: \[ L^\star = \bigcup_{i=0}^\infty\; L^i \text. \]

For example, the set $((\mathtt{a} \circ \mathtt{b}) \cup \mathtt{c})^*$ contains strings like "", "ab", "c", "abab", "ababc" and "cab".

There are also a few common non-primtive regular operations:

  • The non-empty repetition of a language $L$, written $L^+$, is the same as Kleene star, but at least one copy of $L$ must be matched: \[ L^+ = L \circ L^\star \text. \]
  • The option of a language $L$, written $L^?$, is either $L$ or the null string: \[ L^? = L \cup \epsilon \]
  • The bounded repetition of a language $L$, written $L^{[n,m]}$, consists of between $n$ and $m$ occurrences of a language: \[ L^\star = \bigcup_{i=n}^m\; L^i \text. \]

The theory of regular languages provides algorithms and techniques to answer questions like:

  • Given a string $s$ and a language $L$, is $s$ in $L$?
  • Given a string $s$ and a language $L$, which substrings of $s$ are in $L$?
  • Given a language $L$, is it regular?

Regular expressions in code

In code, regular expressions describe matchable patterns over text.

They are often used to describe locations in text (e.g. all lines that match this pattern) and to transform text (e.g. transform text matching a pattern into something different text).

There is no standard for regular expressions in code, but most languages employ a dialect from a common ancestor.

The three major dialects every programmer should know are:

  • basic regular expressions (BRE);
  • extended regular expressions (ERE); and
  • Perl-compatible regular expressions (PCRE).

Since this article is an introduction, it covers BRE and ERE. (PCRE is largely an extension of ERE).

The notation used in all regular expression implementations is inspired by the mathematical formalism.

The following table describes a generic regular expression pattern language:

Math Pattern Pattern meaning
$\emptyset$ no equivalent
$\epsilon$ no character at all matches ""
c c matches "c"
$L_1 \circ L_2$ p1p2 matches p1 then p2
$L_1 \cup L_2$ p1|p2 matches p1 or p2
$L^\star$ p* matches "" or p repeated
$L^+$ p+ matches p repeated, but not ""
$L^?$ p? matches p or ""
$L^n$ p{n} matches p repeated n times
$L^{[n,m]}$ p{n,m} matches p repeated n to m times
$\Sigma$ . matches any character
$\{c_1,\ldots,c_n\}$ [c1...cn] matches $c_1$ or $c_2$ or ... or $c_n$
$\Sigma - \{c_1,\ldots,c_n\}$ [^c1...cn] matches any char but $c_1$ or ... or $c_n$
$(L)$ (p) matches p, remembers submatch
no equivalent \n matches string from nth submatch
no equivalent \b matches a word boundary
no equivalent \w matches a word character, e.g., alphanumeric
no equivalent \W matches a nonword character, e.g., punctuation
no equivalent \s matches a whitespace character, e.g., space, tab, return
no equivalent \S matches a non-whitespace character, e.g., alphanumeric, punctuation
no equivalent \d matches a digit character, i.e., 0-9
no equivalent \D matches a non-digit character, e.g., alphanumeric, punctuation
no equivalent ^ matches start of line/string
no equivalent $ matches end of line/string
no equivalent [c1-c2] matches $c_1$ through $c_2$

Backreferences are numbered by left parentheses: the $n$th left parenthesis denotes the $n$ submatch.

The sections ahead discussing individual tools will note individual differences for dialects like BRE and ERE.

grep: POSIX basic regular expressions

The tool grep can filter a file, line by line, against a pattern.

The command grep pattern file prints each line of file which contains a match for pattern. Given no file, it reads from the standard input.

The equally useful command grep -v pattern file prints each line of the file file which does not contain a match for pattern.

By default, grep uses basic regular expressions (BRE).

BRE differs syntactically in several key ways. Specifically, the operators {}, (), +, | and ? must be escaped with \, and many of the character class shortcuts have names instead:

Math BRE Pattern meaning
$\emptyset$ no equivalent
$\epsilon$ no character at all matches ""
c c matches "c"
$L_1 \circ L_2$ p1p2 matches p1 then p2
$L_1 \cup L_2$ p1\|p2 matches p1 or p2
$L^\star$ p* matches "" or p repeated
$L^+$ p\+ matches p repeated, but not ""
$L^?$ p\? matches p or ""
$L^n$ p\{n\} matches p repeated n times
$L^{[n,m]}$ p\{n,m\} matches p repeated n to m times
$\Sigma$ . matches any character
$\{c_1,\ldots,c_n\}$ [c1...cn] matches $c_1$ or $c_2$ or ... or $c_n$
$\Sigma - \{c_1,\ldots,c_n\}$ [^c1...cn] matches any char but $c_1$ or ... or $c_n$
$(L)$ \(p\) matches p, remembers submatch
no equivalent \n matches string from nth submatch
no equivalent \b matches a word boundary
no equivalent [[:word:]] matches a word character, e.g., alphanumeric
no equivalent [[:space:]] matches a whitespace character, e.g., space, tab, return
no equivalent [[:digit:]] matches a digit character, i.e., 0-9
no equivalent [[:xdigit:]] matches a hex digit character, i.e., A-F, a-f, 0-9
no equivalent [[:upper:]] matches a upperspaced character
no equivalent [[:lower:]] matches a lowerspaced character
no equivalent ^ matches start of line/string
no equivalent $ matches end of line/string
no equivalent [c1-c2] matches $c_1$ through $c_2$

A common use case for grep is command | grep word, which will dump out the lines from the output of command containing the word.

For instance, ps u | grep matt will dump out processes run by the user matt (and possibly a few others that happen to have matt on the line).

A fun way to learn how to use grep is to run it against the dictionary file, /usr/share/dict/words.

Suppose you're playing the crosswords, and you know a word is seven letters long, with a for it second letter and x for the sixth. Get a hint:

 $ grep '^.a...x.$' /usr/share/dict/words
 cachexy
 carboxy
 martext
 panmixy

We can submatch backreferences to print out words that repeat themselves:

 $ grep '^\(.*\)\1$' /usr/share/dict/words
 aa
 adad
 akeake
 anan
 arar
 atlatl
 baba
 barabara
 benben
 beriberi
 bibi
 ...

The \1 refers back to the string matched by the first parenthesized submatch. In this case, that's \(.*\).

Recall that the $n$th left parenthesis denotes the $n$th submatch.

(Technically, backreferences break the regularity of grep.)

We could find strings that consist of a two different repeated strings:

 $ grep '^\(.\+\)\1\(.\+\)\2$' /usr/share/dict/words
 susurr

Apparently, there's only one match in my dictionary!

Using the start-of-line and end-of-line markers were necessary here. Without them, we get words that contain a substring that repeats itself:

 $ grep '\(.\+\)\1' /usr/share/dict/words
 aa
 aal
 aalii
 aam
 aardvark
 aardwolf
 abactinally
 abaff
 abaissed
 abandonee 

In this case, changing the * to \+ also became necessary, since .* matches even the null string, which every string trivially contains.

If you need to find a specific IP address, say 1.10.3.20, in a log file, you can do that by escaping the dots:

 $ grep '\b1\.10\.3\.20\b' log

The word-boundary pattern \b is necessary to prevent lines containing text like 101.10.3.20 from matching.

Useful grep flags

  • -v inverts the match.
  • --color colors the matched text.
  • -F interprets the pattern as a literal string.
  • -H, -h print (or don't print) the matched filename
  • -i matches case insensitively.
  • -l prints names of files that match instead.
  • -n prints the line number.
  • -w forces the pattern to match an entire word.
  • -x forces patterns to match the whole line.

egrep: POSIX extended regular expressions

The tool egrep is identical to grep, except that it uses POSIX extended regular expressions.

POSIX extended regular expressions are identical to basic regular expressions, but the operators {}, (), +, | and ? should not be escaped.

This change substantially unclutters complex expressions, such as the double word example:

 $ egrep '^(.*)\1$' /usr/share/dict/words
 aa
 adad
 akeake
 anan
 arar
 atlatl
 baba
 barabara
 ...

Consider a search for all words that have an oo at least one letter before and ee, or an ee at least one character before an oo:

 $ egrep 'oo.+ee|ee.+oo' /usr/share/dict/words 
 beechwood
 beechwoods
 beefwood
 beetroot
 beetrooty
 bloodweed
 bookkeeper
 bookkeeping
 bootee
 brookweed 
 ...

Consider a search for words that contain between 5 and 7 vowels:

$ egrep '^([^aieou]*[aieou]){5,7}[^aieou]*$' /usr/share/dict/words 
 abacinate
 abacination
 abaisance
 abalienate
 abalienation
 abandonable
 abandonee
 abarticular
 abarticulation
 abastardize
 ...

Warning: Due to strangeness with grep's handling of Unicode, the previous example only worked with the environment variable LANG=C set.

The power of backreferences: Prime-finding

Backreferences, as noted, break the regularity of the pattern language.

There's a famous regex which uses backreferences to match composite (non-prime) numbers in unary form:

 $^(11+)(\1)+$

Thus, egrep -v '^(11+)\1+$' will print out only lines of prime length:

 $ egrep -v '^(11+)\1+$' <<EOF
 11
 111
 1111
 11111
 111111
 1111111
 11111111
 111111111
 1111111111
 11111111111
 EOF
 11
 111
 11111
 1111111
 11111111111 

Most variants of this reegx use the perl-extended (11+?) in place of (11+).

The +? means try the minimal match first, which directs the backtracking to be a little more intelligent in the order that it searches.

But, for correctness, minimal-match-first is not necessary.

If there exists a match at all, then the number is not prime.

For more discussion of this (and related) regexen and its limits, see Andrei Zmievski's write-up.

According to the lore, Abigail created this regex.

sed

sed is a "stream editor."

It reads a file line-by-line, conditionally applying a sequence of operations to each line and (possibly) printing the result.

By default, sed uses POSIX basic regular expression syntax. To use the (more comfortable) extended syntax, supply the flag -E.

Most sed programs consist a single sed command: substitute.

For example, to substitute instances of the regular expression [ch]at for ball, use:

 $ sed 's/[ch]at/ball/g' < in > out

A proper sed program is a sequence of sed commands.

Most sed commands have one of three forms:

  1. operation -- apply this operation to the current line.
  2. address operation -- apply this operation to the current line if at the specified address.
  3. address1,address2 operation -- apply this operation to the current line if between the specified addresses.

Numeric addresses

The simplest address is a line number.

For example, to print the first 12 lines, use sed '12q'. The command q quits sed. So, this program prints after it hits the 12th line.

To print only the fourth line, use sed -n '4p'. The flag -n suppresses the default printing behavior, while the command p prints the line.

For convenience, the address $ refers to the last line.

Pattern addresses

Addresses can be regular expressions in the form of /pattern/.

For example, to extract the text between <body> and </body> in a file use the following sed program:

#!/usr/bin/sed -E -n -f
/<body>/,/<\/body>/ p

But, this also prints out the body tags.

A group command { ... } helps here:

#!/usr/bin/sed -E -n -f
/<body>/,/<\/body>/ {
  /<body>/b
  /<\/body>/b
  p
}

In this case, the b command skips to the next line.

But, this will miss text on the same line as the opening and closing tags.

Using substitute commands to strip out the tags fixes this problem:

#!/usr/bin/sed -E -n -f
/<body>/,/<\/body>/ {
  s/^.*<body>//
  s/<\/body>.*$//
  p
}

But, this breaks in the (rare) case of a body tag being on one line, as in:

  <body> hello world </body>

The problem is that ranges cannot start and end on the same line.

To get around this, add a special case to catch it:

 #!/usr/bin/sed -E -n -f
 /<body>.*<\/body>/ {
   s/<body>(.*)<\/body>/\1/
   p
   q
 }
 
 /<body>/,/<\/body>/ {
   s/^.*<body>//
   s/<\/body>.*$//
   p
 }

But, this script still breaks if there are nested body tags in the document.

If nesting in a pattern matters, it's probably time to switch to a formalism more powerful than regular languages, such as context-free languages.

Useful operations

  • The group operation { operation1 ; ... ; operationn } executes all of the specified operations, in order, on the given address.
  • The operation s/pattern/replacement/arguments replaces instances of pattern with replacement according to the arguments in the current line. In the replacement, \n stands for the nth submatch, while & represents the entire match.
  • The operation b branches to a label, and if none is specified, then sed skips to processing the next line. Think of this as a break operation.
  • The operation y/from/to/ transliterates the characters in from to their corresponding character in to.
  • The operation q quits sed.
  • The operation d deletes the current line.
  • The operation w file writes the current line to the specified file.

Common arguments to the substitute operation

The most common argument to the substitute command is g, which means "globally" replace all matches on the current line, instead of just the first.

Sometimes, other arguments are useful:

  • n tells sed to replace the nth match only, instead of the first.
  • p prints out the result if there is a substitution.
  • i ignores case during the match.
  • w file writes the current line to file.

Useful flags

  • -n suppresses automatic printing of each result; to print a result, use command p.
  • -f sedfile uses sedfile as the sed program.

Examples

Strip comment lines starting with #:

 $ sed '/^#/d' 

Delete C++-style // comments

 $ sed 's/\/\/.*$//'

Encrypt with the Caeser cipher:

 $ sed 'y/abcdefghijklmnopqrstuvwxyz/defghijklmnopqrstuvwxyzabc/' 

Decrypt with the Caesar cipher:

 $ sed 'y/defghijklmnopqrstuvwxyzabc/abcdefghijklmnopqrstuvwxyz/' 

Change names from "Last, First [Middle/Middle Initial.]" to "First [Middle/Middle Initial.] Last":

 $ sed -E 's/([A-Z][a-z]*), ([A-Z][a-z]*( [A-Z][a-z]*[.]?)?)/\2 \1/g'
 Might, Matthew B.
 Matthew B. Might

Next steps with sed

sed is much more powerful than this summary alludes.

There are label (:) and branching commands (b, t) that allow loops, and in theory, arbitrary (Turing-equivalent) computation.

sed keeps track of both a pattern space (the current line) and hold space, and there are commands to manipulate both of them, e.g., g, G, h and H.

That said, you should probably never use these commands!

If you find yourself tempted to use these more advanced constructs, it's a sign that you want to use a tool like awk or Perl instead.

AWK

The awk command provides a more traditional programming language for text processing than sed.

Those accustomed to seeing only hairy awk one-liners might not even realize that AWK is a real programming language. For example, here's a comprehensible AWK program that prints the factorial of each line:

#!/usr/bin/awk -f

{ print factorial($0); }

function factorial(n) {
 if (n == 0) 
   return 1;
 else 
   return n*factorial(n-1);
}

Of course, AWK can be terse and obtuse too. Here's a popular one-liner that prints out the unique lines of a file:

awk '!a[$0]++' file

The major difference in philosophy between AWK and sed is that AWK is record-oriented rather than line-oriented.

Each line of the input to AWK is treated like a delimited record.

The AWK philosophy melds well with the Unix tradition of storing data in ad hoc line-oriented databases, e.g., /etc/passwd.

That is, where sed sees a file like this:

 line1
 line2
 line3
 ...

awk sees a files like this:

 record1
 record2
 record3
 ...

where each record is:

 field1 field2 field3 ...

The command line parameter -F regex sets the regular expression regex to be the field delimiter.

For instance, awk -F "," sees each record as:

 field1,field2,field3,...

To print out the account name and uid from /etc/passwd, use:

 $ awk -F : '/^[^#]/ { print $1, $3 }' /etc/passwd
 nobody -2
 root 0
 daemon 1
 ...

AWK programs

An AWK program consists of pattern-action pairs:

pattern { statements }

followed by an (optional) sequence of function definitions.

In fact, an action is optional, and a pattern by itself is equivalent to:

pattern { print }

As each record is read, each pattern is checked in order, and if it matches, then the corresponding action is executed.

Function definition

The form for function defintion is:

function name(arg1,...,argn) { statements }

As in C, a return statement returns the result of the function.

Patterns

The most common one-line pattern in AWK is the blank pattern, which matches every line.

The other pattern forms include:

  • /regex/, which matches if the regex matches something on the line;
  • expression, which matches if expression is nonzero or non-null;
  • p1, p2, which matches all records (inclusive) between p1 and p2.
  • BEGIN, which matches before the first line read;
  • END, which matches after the last line is read;

Some implementations of awk, like gawk, provide additional patterns:

  • BEGINFILE, which matches before a new file is read; and
  • ENDFILE, which matches after a file is read.

Expressions

AWK expressions appear in both patterns and in statements.

A basic AWK expression is either:

  • a special variable, e.g., $1 or NF;
  • a regular variable, e.g., foo
  • a string literal, e.g., "foobar";
  • a numeric constant, e.g., 3, 3.1;
  • a regex constant, e.g., /foo|bar/

A regex constant can be passed as a first-class value to a function.

AWK supports a match expression form, exp1 ~ exp2, where the assumption is that exp1 will evaluate to a string, exp2 will evaluate to a regex, and the result of matching is returned.

A lone regex constant in a conditional is implicitly equivalent to a match against the current record; that is, /regex/ becomes $0 ~ /regex/.

For example, to filter lines that contain both foo and bar:

 $ awk '/foo/ && /bar/ { print }'

or just:

 $ awk '/foo/ && /bar/'

AWK brings the expected C-like arithmetic (like +), comparison (like ==) and Boolean operators (like &&).

As in C, variable assignment is an expression rather than a statement.

For example, to print account names from /etc/passwd where the account number is 500, use:

 $ awk -F : '$3 == 500 { print $1 }' /etc/passwd

String concatenation is simply juxtaposition. As a result, it may be necessary to surround strings to be concatenated with parentheses, e.g., ("bar = " bar ".").

Abutting a name with parentheses indicates function call; for example, the following program surrounds every line of input with curly braces:

#!/usr/bin/awk -f

{ print f($0) }

function f(line) {
    return ("{" line "}") ;
}

Arrays

AWK supports both scalars and arrays.

Arrays in AWK are associative, much like objects in JavaScript.

To reference an index in an array, use the C-style subcript notation, variable-name[index], where index can be any expression that evaluates to a scalar value.

There is no need to create an array explicitly: just assign into an index in an undefined variable name.

To check for the existence of an index, use the in operator: index in variable-name.

For example, to print the account name with the highest uid run the following on /etc/passwd:

#!/usr/bin/awk -F : -f

/^#/ { next ; }

{ users[$3] = $1 ; }

END {
 max = 0 ;
 for (i in users) {
  if ((i+0) > (max+0)) 
   max = i ;
 }
 print users[max];
}

The (i+0) and (max+0) is necessary to forcibly convert them to numerics. Otherwise, < compares them lexically as strings.

Arrays have a split first-/second-class status in AWK.

Arrays are passed as parameters to procedures by reference.

But, it is not possible to assign an array to a variable.

#!/usr/bin/awk -f

BEGIN {
  arr[0] = 1 ;
  print 0, arr[0] ;   # prints 0 1
  modify_array(arr) ; # ok
  print 0, arr[0] ;   # prints 0 2
  brr = arr ;         # error
  exit ;
}

function modify_array(array) {
  for (k in array) {
    array[k]++ ;
  }
}

Arrays may not be returned from procedures either.

Special variables

There are several special variables in AWK:

Variable Meaning
$0 text of the matched record
$n the nth entry in the current record
FILENAME name of current file
NR number of records seen thus far
FNR number of records thus far in this file
NF number of fields in current record
FS input field delimiter, defaults to whitespace
RS record delimiter, defaults to newline
OFS output field delimiter, defaults to space
ORS output record delimiter, defaults to newline

These special variables can be used in patterns.

For instance, one could print the even lines:

 $ awk 'NR % 2 == 0 { print }'

Special variables like OFS can also be assigned as the program executes.

Technically, $n is not a variable.

In fact, $ is a special pseudoarray applied to the expression on its right.

For example, $(0) is an expression, as are $i and $(a[i]).

And, by extension, $NF is the last field.

Statements

AWK is a small language, with only a handful of forms for statements.

The man page lists all of them:

  if (expression) statement [ else statement ]
  while (expression) statement
  for (expression; expression; expression) statement
  for (var in array) statement
  do statement while (expression)
  break
  continue
  { [ statement ... ] }
  expression  
  print [ expression-list ] [ > expression ]
  printf format [ , expression-list ] [ > expression ]
  return [ expression ]
  next             
  nextfile
  delete array[expression]
  delete array            
  exit [ expression ] 

The most common statement is print, which is equivalent to print $0.

If arguments to print are comma-separated, then they are spliced together with OFS.

For example:

 $ echo foo bar | awk '{ OFS="::" ; print $1, $2 ; exit }'
 foo::bar

Most of these statements should be familiar to programmers, and some look eerily similar to those found in JavaScript.

The delete statement deletes an index from an array, or alternately, the entire array.

Control statements

AWK supports C-style control constructs like if, for and while.

It also supports a special for form for iterating over the keys in an associative array:

for (var in array-name) statement

The control statements next and nextfile skip to the next line of input and the next file respectively.

Built-in functions

AWK comes with a large set of built-in functions.

These are also listed in the AWK man page.

Perhaps the most useful is gensub(regex, replacement, params [ , input ]), which returns roughly the result of sed's s/regex/replacement/params run on input or $0 by default.

For example, to change C++-style // comments to C-style comments:

 $ awk '{ print gensub(/\/\/[ ]?(.*)/, "/* \\1 */", "g" ) }'

Not all AWK implementations support gensub, so you might have to use the specializations sub and gsub instead.

Useful flags

  • -f filename uses the provided file as the AWK program.
  • -F regex sets the input field separator.
  • -v var=value sets a global variable. Multiple -v flags are allowed.

vim and emacs

Text editors in the Unix tradition excel at manipulating text.

If you haven't yet taken the (brief) tutorial for both editors, do so at your earliest convenience.

You can apply the knowledge from this article inside vim and emacs, which have their own rich regex-based search-and-replace systems:

Command vim emacs
search /pattern C-M-s pattern RET
replace :s/pat/new/ M-x replace-regexp RET pat RET new RET

Both editors default to a BRE-like syntax.

In both, the escape \n expands into the nth submatch.

In emacs, the escape \& expands into the matched text, while just the character & expands into the matched text in vim.

You can also direct both editors to interact with sed and AWK, or any other shell command for that matter:

Command vim emacs
insert output of command :r!command M-1 M-! command
pipe selection to command :'<'>!command M-1 M-| command RET

Related posts and further reading


儿童贫血吃什么补血最快 情绪价值是什么意思 农历今天属什么 女人一般什么时候绝经 骨折吃什么恢复快
人中龙凤下一句是什么 十一月一日是什么星座 痛风喝什么茶最好 杜牧号什么 陆陆续续是什么意思
思觉失调是什么意思 750是什么材质 腰椎退行性变是什么病 多多益善什么意思 soda是什么意思
尿液黄绿色是什么原因 一只眼皮肿是什么原因 白月光是什么意思 mcm牌子属于什么档次 补水什么意思
包皮炎是什么症状hcv9jop4ns3r.cn 棉条是什么hcv9jop3ns6r.cn 宫颈纳囊什么意思hcv7jop5ns1r.cn 公费医疗什么意思hcv7jop9ns7r.cn 旅游有什么好处hcv9jop3ns2r.cn
肾阴阳两虚吃什么中成药hcv8jop9ns2r.cn 新生儿什么时候剪头发hcv8jop5ns0r.cn 国际章是什么意思hcv9jop4ns8r.cn 霍山黄芽属于什么茶hcv8jop6ns8r.cn 医院量身高为什么会矮bjcbxg.com
清洁度2度是什么意思hcv9jop6ns8r.cn 5月30是什么星座hcv8jop6ns6r.cn 失聪是什么意思hcv9jop2ns2r.cn 什么什么生机hcv8jop8ns0r.cn 凯撒沙拉酱是什么口味hcv9jop1ns5r.cn
穿青人是什么民族hcv8jop0ns9r.cn 尽形寿是什么意思hcv8jop1ns5r.cn 策字五行属什么hcv9jop0ns7r.cn 爵迹小说为什么不写了hcv8jop5ns9r.cn 89年什么命hcv7jop9ns4r.cn
百度