Lex - A Lexical Analyzer Generator
lex原理 -回复

lex原理-回复lex原理的基本原理和使用方法一、介绍Lex原理Lex是一个流行的词法分析器生成器,它是由AT&T贝尔实验室的Eric Schmidt和Mike Lesk在1975年开发的。
Lex程序通过读取输入流并将其分解为词素(tokens)序列的方式来处理输入。
它的设计目的是为了将词法分析器的开发过程自动化,并提供一种快速、高效的方法来生成词法分析器。
二、词法分析原理在分析过程中,词法分析器根据预先定义好的规则模式来将输入流分解为词素。
每个词素都有一个与之对应的记号(token),用于表示它所代表的语言单元。
这些规则通常是用一种基于正则表达式的模式语言来描述的。
1. 模式匹配模式匹配是词法分析器的核心原理。
在Lex中,模式是由正则表达式表示的。
词法分析器将按照定义的顺序尝试每个规则,直到找到匹配输入流的最长的模式。
如果多个模式都匹配到了同样长度的最长模式,则将按照定义顺序选择最先被定义的规则。
例如,假设我们有以下两个规则:[0-9]+ { printf("NUM\n"); }[+-] { printf("OP\n"); }当输入流中的文本是"123"时,模式"[0-9]+"会匹配到整个输入,因此词法分析器会将其解释为一个NUM记号,并执行与之对应的操作。
如果输入文本是"+",则模式"[+-]"会匹配到该输入,词法分析器会将其解释为一个OP记号,并执行对应的操作。
2. 动作执行当词法分析器找到与输入流匹配的模式时,将执行与之关联的动作。
这些动作通常是一些C语言代码或其他程序代码片段,用于对匹配的词素进行进一步处理。
例如,在前面的例子中,动作是将识别到的记号打印到控制台。
3. 生成词法分析器使用Lex生成词法分析器主要包括两个步骤:编写词法规则文件和生成词法分析器。
编写词法规则文件时,我们需要定义一系列的词法规则,每个规则用一个模式-动作对来表示。
编译器的自动生成工具LEX的使用方法

编译器的自动生成工具LEX和YACC的使用方法Lex自动地表示把输入串词法结构的正规式及相应的动作转换成一个宿主语言的程序,即词法分析程序,它有一个固定的名字yylex,在这里yylex是一个C语言的程序。
yylex将识别出输入串中的词形,并且在识别出某词形时完成指定的动作。
看一个简单的例子:写一个lex源程序,将输入串中的小写字母转换成相应的大定字母。
程序如下:%%[a-z]printf(“%c”.yytext[0]+'A'-'a');上述程序中的第一行%%是一个分界符,表示识别规则的开始。
第二行就是识别规则。
左边是识别小写字母的正规式。
右边就是识别出小写字母时采取的动作:将小写字母转换成相应的大写字母。
Lex的工作原理是将源程序中的正规式转换成相应的确定有限自动机,而相应的动作则插入到yylex中适当的地方,控制流由该确定有限自动机的解释器掌握,不同的源程序,这个解释器是相同的。
1.2 lex源程序的格式lex源程序的一般格式是:{辅助定义的部分}%%{识别规则部分}%%{用户子程序部分}其中用花括号起来的各部分都不是必须有的。
当没有“用户子程序部分”时,第二个%%也可以省去。
第一个%%是必须的,因为它标志着识别规则部分的开始,最短的合法的lex源程序是:%%它的作用是将输入串照原样抄到输出文件中。
识别规则部分是Lex源程序的核心。
它是一张表,左边一列是正规式,右边一列是相应的动作。
下面是一条典型的识别规则:integer printf("found keywcrd INT");这条规则的意思是在输入串中寻找词形“integer”,每当与之匹配成功时,就打印出“foundkeyword INT”这句话。
注意在识别规则中,正规式与动作之间必须用空格分隔开。
动作部分如果只是一个简单的C表达式,则可以写在正规式右边同一行中,如果动作需要占两行以上,则须用花括号括起来,否则会出错。
flex编译原理教程

flex编译原理教程Flex编译原理教程一、引言Flex(Fast Lexical Analyzer Generator)是一个快速的词法分析器生成工具,它能够将输入的正则表达式规则转化为有效的C代码,用于实现词法分析的过程。
本文将介绍Flex编译原理的基本概念和实现过程。
二、什么是词法分析词法分析是编译过程中的第一个阶段,它负责将源程序中的字符序列划分为有意义的词素(Token)序列。
词素是语言中的基本单位,例如关键字、标识符、常数、运算符等。
词法分析器的任务就是根据预先定义的词法规则,将输入的字符序列转化为词素序列。
三、Flex编译原理概述Flex的工作原理是基于有限状态自动机(Finite State Automaton)的。
它将词法规则表示成一系列正则表达式,并将其转化为NFA (Nondeterministic Finite Automaton)和DFA(Deterministic Finite Automaton)。
Flex会将这些自动机转化为C代码,从而实现词法分析器。
四、Flex编译原理详解1. 定义词法规则在Flex中,词法规则是用正则表达式表示的。
每个规则由两部分组成:模式(pattern)和动作(action)。
模式用于匹配输入字符序列,动作则指定匹配成功后的处理逻辑。
2. 构建NFA根据词法规则,Flex会构建一组NFA片段,每个片段对应一个词法规则。
NFA片段由一组状态和转移函数组成。
状态表示在词法分析过程中的不同状态,转移函数表示状态之间的转换关系。
3. 合并NFA将所有NFA片段合并成一个大的NFA。
合并的过程中,Flex会将各个片段的接受状态通过ε转移链接在一起,形成新的接受状态。
4. 子集构造法通过子集构造法将NFA转化为DFA。
子集构造法的基本思想是根据当前状态和输入字符,确定下一个状态。
通过不断迭代,直到构造出完整的DFA。
5. DFA最小化对生成的DFA进行最小化处理,去除一些不可达状态和等价状态,减少状态的数量。
lex工具实例

翻译规则部分(续)
为避免二义性,在Ri中 若需出现空格、回车或 制表符,则应用‘\s’, ‘\n’,‘\t’表示。 每个代码段Actioni可 引用已定义的常量、全 局变量和外部变量,也 可以调用在辅助函数部 分定义的函数。
注意:Actioni中的C
程序语句多于一条时, 必须用花括号{ }将 它们括起来,否 则,LEX将会报错
词法分析程序的使用
用户可通过调用函数yylex()使用词法分析程序。每调 用yylex()函数一次,yylex()将从输入流中识别一个单词, 并返回一个整型值(被识别出的单词之内部码)。 所谓单词的内部码是指词法分析程序在识别出某一类 单词时所返回的整型数值(也称为类别码),它由用 户自行定义其取值。 例如,用户可定义标识符Id的内部码为300,实型数 RealNo的内部码为450等等。一般说来,用户自定义的 单词类别码应大于256,而小于256的内部码值可用于 字符单词的类别码,用其ASCII码值来表示。如单词 '+'的内部码可定义为‘+’的ASCII码值43。
LEX的命令格式及选项
格式:lex [-ctvn -V -Q[y|n]] [files] 选项: -c 指明动作为C语句(缺省设置). -t 输出到stdout 而不是 lex.yy.c(缺省设置); -v 提供一个两行的统计概述 -n 不输出-v的概述 -V 在stderr上输出lex的版本信息. -Q[y|n] y:在输出文件lex.yy.c上打印版本信息. n:不打印(缺省设置). 若files为多个文件,则被视为一个,若files为空,则使用stdin.
1.声明(declaration)
LEX的声明又可以分两个
flex资料

LEX介绍LEX(Lexical Analyzer Generator)即词法分析器生成工具是1972年贝尔实验室的M.E.Lesk和E.Schmidt在UNIX操作系统上首次开发的。
GNU同时推出了和LEX完全兼容的FLEX(Fast Lexical Analyzer Genrator)。
下面用到的例子都是基于flex的。
LEX工作原理:LEX通过对源文件的扫描,经过宏替换将规则部分的正则表达式转换成与之等价的DFA,并产生DFA的状态转换矩阵(稀疏矩阵);利用该矩阵和源文件的C代码产生一个名为int yylex()的词法分析函数,将yylex()函数拷贝到输出文件lex.yy.c中。
函数yylex()以在缺省条件下的标准输入(stdin)作为词法分析的输入文件。
输入文件(扩展名为.l)LEX输出文件lex.yy.cLex源文件格式为:定义部分%%规则部分%%用户附加的C语言代码例1:int num_chars=0,num_lines=0;/*定义两个全局变量,一个及字符数,一个记行数.注意:该语句不能顶行*/%%\n ++num_chars++; ++num_lines;. ++num_chars;%%int main(){yylex();printf(“%d,%d”, num_chars,num_lines);}int yywrap()/*文件结束处理函数,当yylex()读到文件结束标记EOF时,调用该函数时会,用户必须提供该函数,否则会提示编译出错*/{return 1;//返回1表示文件扫描结束,不必再扫描别的文件}lex 的输入文件分成三个段,段间用%% 来分隔。
由于必须存在一个规则段,第一个%% 总是要求存在模式LEX的模式是机器可读的正则表达式。
表意字符匹配字符. 除换行外的所有字符\n 换行* 0 次或无限次重复前面的表达式+ 1 次或更多次重复前面的表达式? 0 次或1 次出现前面的表达式^ 行的开始$ 行的结尾a|b a 或者b(ab)+ 1 次或玩多次重复ab"a+b" 字符串a+b 本身(C 中的特殊字符仍然有效)[] 字符类[^ab] 除a,b 外的任意字符[a^b] a, ^, b 中的一个[az] 从a 到z 中的任意字符[a\-z] a,-,z 中的一个[a-z] a, z 中的一个<EOF> 匹配文件结束标记表1 :简单模式匹配在方括号([])中,通常的操作失去了本来含意。
LEX

LEX(词法分析器生成工具)学习笔记1LEX介绍LEX(Lexical Analyzer Generator)即词法分析器生成工具是1972年贝尔实验室的M.E.Lesk和E.Schmidt在UNIX 操作系统上首次开发的。
GNU同时推出了和LEX完全兼容的FLEX(Fast Lexical Analyzer Genrator)。
下面用到的例子都是基于flex的。
LEX工作原理:LEX通过对源文件的扫描,经过宏替换将规则部分的正则表达式转换成与之等价的DFA,并产生DFA的状态转换矩阵(稀疏矩阵);利用该矩阵和源文件的C代码产生一个名为int yylex()的词法分析函数,将yylex()函数拷贝到输出文件lex.yy.c中。
函数yylex()以在缺省条件下的标准输入(stdin)作为词法分析的输入文件。
定义部分%%规则部分%%用户附加的C语言代码例1:int num_chars=0,num_lines=0;/*定义两个全局变量,一个及字符数,一个记行数.注意:该语句不能顶行*/%%\n ++num_chars++; ++num_lines;. ++num_chars;%%int main(){yylex();printf(“%d,%d”, num_chars,num_lines);}int yywrap()/*文件结束处理函数,当yylex()读到文件结束标记EOF时,调用该函数时会,用户必须提供该函数,否则会提示编译出错 */{return 1;//返回1表示文件扫描结束,不必再扫描别的文件}lex 的输入文件分成三个段,段间用%% 来分隔。
由于必须存在一个规则段,第一个%% 总是要求存在。
模式LEX的模式是机器可读的正则表达式。
表意字符匹配字符. 除换行外的所有字符\n 换行* 0 次或无限次重复前面的表达式+ 1 次或更多次重复前面的表达式? 0 次或1 次出现前面的表达式^行的开始$ 行的结尾a|b a 或者 b(ab)+ 1 次或玩多次重复 ab"a+b"字符串a+b 本身( C 中的特殊字符仍然有效)[] 字符类[^ab] 除a,b外的任意字符[a^b] a, ^, b中的一个[az]从a到z中的任意字符[a\-z] a,-,z中的一个[a-z] a, z 中的一个<EOF>匹配文件结束标记表 1 :简单模式匹配在方括号([])中,通常的操作失去了本来含意。
Yacc 与 Lex 快速入门
Yacc 与 Lex 快速入门Lex 与 Yacc 介绍Lex 和 Yacc 是 UNIX 两个非常重要的、功能强大的工具。
事实上,如果你熟练掌握 Lex 和 Yacc 的话,它们的强大功能使创建 FORTRAN 和 C 的编译器如同儿戏。
Ashish Bansal 为您详细的讨论了编写自己的语言和编译器所用到的这两种工具,包括常规表达式、声明、匹配模式、变量、Yacc 语法和解析器代码。
最后,他解释了怎样把 Lex 和 Yacc 结合起来。
Lex 代表 Lexical Analyzar。
Yacc 代表 Yet Another Compiler Compiler。
让我们从 Lex 开始吧。
LexLex 是一种生成扫描器(词法分析器)的工具。
扫描器是一种识别文本中的词汇模式的程序。
这些词汇模式(或者常规表达式)在一种特殊的句子结构中定义,这个我们一会儿就要讨论。
一种匹配的常规表达式可能会包含相关的动作。
这一动作可能还包括返回一个标记。
当 Lex 接收到文件或文本形式的输入时,它试图将文本与常规表达式进行匹配。
它一次读入一个输入字符,直到找到一个匹配的模式。
如果能够找到一个匹配的模式,Lex 就执行相关的动作(可能包括返回一个标记)。
另一方面,如果没有可以匹配的常规表达式,将会停止进一步的处理,Lex 将显示一个错误消息。
Lex 和 C 是强耦合的。
一个.lex文件(Lex 文件具有.lex的扩展名)通过 lex 公用程序来传递,并生成 C 的输出文件。
这些文件被编译为词法分析器的可执行版本。
Lex 的常规表达式常规表达式是一种使用元语言的模式描述。
表达式由符号组成。
符号一般是字符和数字,但是 Lex 中还有一些具有特殊含义的其他标记。
下面两个表格定义了 Lex 中使用的一些标记并给出了几个典型的例子。
用 Lex 定义常规表达式字符含义A-Z, 0-9, a-z构成了部分模式的字符和数字。
.匹配任意字符,除了 \n。
Lex说明文档_中文版
Lex - 词法分析器生成器1.FLEX简介单词的描述称为模式(Lexical Pattern),模式一般用正规表达式进行精确描述。
FLEX通过读取一个有规定格式的文本文件,输出一个如下所示的C语言源程序。
FLEX的输入文件称为LEX源文件,它内含正规表达式和对相应模式处理的C语言代码。
LEX源文件的扩展名习惯上用.l表示。
FLEX通过对源文件的扫描自动生成相应的词法分析函数 int yylex(),并将之输出到名规定为lex.yy.c的文件中。
实用时,可将其改名为lexyy.c。
该文件即为LEX的输出文件或输出的词法分析器。
也可将 int yylex()加入自已的工程文件中使用。
2.模式简介LEX的模式的格式(也称为规则)是机器可读的正规表达式,正规表达工是用连接、并、闭包运算递归生成的。
为了方便处理,LEX在此基础上增加了一些运算。
下列是按运算优先级由高往低排正规表列的LEX的达式的运算符。
“\[]^-?.*+|()/${}%<>关于LEX的模式定义,可参见下页附表1.13.LEX源文件格式LEX对源文件的格式要求非常严格,比如若将要求顶行书写的语句变成非顶行书写就会产生致命错误。
而LEX本身的查错能力很弱,所以书写时一定要注意。
LEX的源文件由三个部份组成,每个部分之间用顶行的“%%”分割,其格式如下:定义部份%%规则部份%%用户附加C语言部份3.1定义部份定义部份由C语言代码、模式的宏定义、条件模式的开始条件说明三部份组成。
其中,C代码部份由顶行的%{和}%引入,LEX扫描源文件时将%{和}%之间的部分原封不动的拷贝到输出文件lex.yy.c中.附表1.1 LEX 的模式定义宏名2 宏定义2……例如:DIGIT [0-9]ID [A-Za-z][A-Za-z0-9_]*宏名是以字母和下划线”_”开始,以字母、数字和下划线组成的字符串,且大小写敏感。
宏名必须顶行写,宏名和宏定义必须写在同一行上。
2.6节 词法分析器的自动生成-编译原理及实践教程(第3版)-黄贤英-清华大学出版社
• flex++——用C++语言的词法分析器生成器 • Jflex——一个Java的词法/语法分析生成器 • ANTLR——ANother Tool for Language
Recognition,是Java开发的词法分析工具,它可 以接受词文法语言描述,并能产生识别这些语言 的语句的程序
这样的名称定义新的正规式。设字母表为Σ,正 规定义式是如下形式的定义序列:
d1→r1 d2→r2 ……
dn→rn
其中di(i=1,2,……,n)是不同的 名字,ri(i=1,2,……,n)是在 Σ∪{d1,d2,……,di-1}上定义的正规 式,即用基本符号和前面已经定义的
名字表示的正规式。
• Pascal语言中标识符是以字符开头的由字
词法分析的自动生成工具——LEX
• Lex 是一种生成词法分析器的工具。广泛用于产
生各种语言的词法分析器,也称为Lex编译器。
• 词法分析器是一种识别文本中的单词模式的程序。
单词模式(或者正规式)用一种特殊的句子结构 来定义,成为Lex语言。
• 对每个正规式给定一组相关的动作(可能还包括
返回一个标记)。
2.6词法分析程序的自动构造
词法分析程序的自动构造方法,基于有穷自动机和 正规式的等价性,即:
1.对于∑上的一个NFA M,可以构造一个∑上的正规式R, 使得L(R)=L(M)。
2.对于∑上的一个正规式R,可以构造一个∑上的NFA M, 使的L(M)=L(R)。
给定正规式后,一定可以找到一个有穷自动机与之等价
式的
\a
约定
利用LEX自动生成词法分析程序
利用LEX自动生成词法分析程序摘要:《编译原理》是国内外各高等院校计算机科学技术类专业,特别是计算机专业的一门重要专业课程。
该课程系统地向学生介绍编译程序的结构、工作流程及编译程序各组成部分的设计原理和实现技术编译原理涉及词法分析,语法分析,语义分析及优化设计等各方面。
词法分析阶段是编译过程的第一个阶段,是编译的基础。
这个阶段的任务是从左到右一个字符一个字符地读入源程序,即对构成源程序的字符流进行扫描然后根据构词规则识别单词(也称单词符号或符号)。
词法分析程序实现这个任务。
词法分析程序可以使用Lex等工具自动生成。
本课设是用lex自动生成简单的c语言词法分析程序。
关键字:lex c语言词法分析程序引言词法分析程序是对代码中单词的分析,是编译过程的第一阶段,在这个阶段中从左到右一个字符一个字符地读入源程序,即对构成源程序的字符流进行扫描然后根据构词规则识别单词(也称单词符号或符号)。
Lex是一个广泛的使用工具,UNIX系统中使用lex命令调用。
它用于构造各种各样的语言词法分析程序。
词法分析是所有分析优化的基础,涉及的知识较少,如状态转换图等,易于实现。
下面将论述lex构造简单C语言的词法分析程序,深刻地去理解词法分析。
1概述1.1设计目标对C语言设计并实现一个简单的词法分析器,要求能够掌握编译原理的基本理论,,理解编译程序的基本结构,掌握编译各阶段的基本理论和技术,掌握编译程序设计的基本理论和步骤.,增强编写和调试高级语言源程序的能力,掌握词法分析的基本概念和实现方法,熟悉C语言的各种Token。
1.2设计内容用lex对C语言设计并实现一个简单的词法分析器,并用C语言代码进行测试,用二元式的方式给出测试结果。
2设计原理基于Parser Genarator的词法分析器构造方法。
Lex输入文件由3个部分组成:定义集(definition),规则集(rule)和辅助程序集(auxiliary routine)或用户程序集(user routine)。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Lex - A Lexical Analyzer GeneratorM. E. Lesk and E. SchmidtABSTRACTLex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine.Lex source is a table of regular expressions and corresponding program fragments. The table istranslated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the currentpartition, so that the user has general freedom to manipulate it.Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-compiler system.1. Introduction.Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed.The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for theuser's program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language.Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user's background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Ratfor [2] has been available in the past. Lex itself exists on UNIX, GCOS, and OS/370; but the code generated by Lex may be taken anywhere the appropriate compilers exist.Lex turns the user's expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. See Figure 1.+-------+Source -> | Lex | -> yylex+-------++-------+Input -> | yylex | -> Output+-------+An overview of LexFigure 1For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %%[ \t]+$ ;is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the + indicates ``one or more ...''; and the $ indicates ``end of line,'' as in QED. No actionis specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule:%%[ \t]+$ ;[ \t]+ printf(" ");The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing thedesired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs.Lex can be used alone for simple transformations, orfor analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such acase (which might be the first half of a compiler, for example) is shown in Figure 2. Additional programs, written by other generators or by hand, can be added easily to programs written by Lex.lexical grammarrules rules| |v v+---------+ +---------+| Lex | | Yacc |+---------+ +---------+| |v v+---------+ +---------+Input -> | yylex | -> | yyparse | -> Parsed input +---------+ +---------+Lex with YaccFigure 2Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing.Lex generates a deterministic finite automaton from the regular expressions in the source [4]. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex.In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine.Lex is not limited to source which can be interpreted on the basis of one character lookahead. For example,if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd. . . Such backup is more costly than the processing of simpler languages.2. Lex Source.The general format of Lex source is:{definitions}%%{rules}%%{user subroutines}where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus%%(no definitions, no rules) which translates into a program which copies the input to the output unchanged.In the outline of Lex programs shown above, the rules represent the user's control decisions; they are a table, in which the left column contains regular expressions (see section 3) and the right column contains actions, program fragments to be executed when the expressions are recognized. Thus an individual rule might appearinteger printf("found keyword INT");to look for the string integer in the input stream and print the message ``found keyword INT'' whenever it appears. In this example the host procedural languageis C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired tochange a number of words from British to American spelling. Lex rules such ascolour printf("color");mechanise printf("mechanize");petrol printf("gas");would be a start. These rules are not quite enough, since the word petroleum would become gaseum; a way of dealing with this will be described later.3. Lex Regular Expressions.The definitions of regular expressions are very similar to those in QED [5]. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expressionintegermatches the string integer wherever it appears and the expressiona57Dlooks for the string a57D.Operators. The operator characters are" \ [ ] ^ - ? . * + | ( ) $ / { } % < >and if they are to be used as text characters, an escape should be used. The quotation mark operator (") indicates that whatever is contained between a pair ofquotes is to be taken as text characters. Thusxyz"++"matches the string xyz++ when it appears. Note that a part of a string may be quoted. It is harmless but unnecessary to quote an ordinary text character; the expression"xyz++"is the same as the one above. Thus by quoting everynon-alphanumeric character being used as a text character, the user can avoid remembering the list above of current operator characters, and is safe should further extensions to Lex lengthen the list.An operator character may also be turned into a text character by preceding it with \ as inxyz\+\+which is another, less readable, equivalent of the above expressions. Another use of the quoting mechanism is to get a blank into an expression; normally, as explained above, blanks or tabs end a rule. Any blank character not contained within [] (see below) must be quoted. Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must be used; it is not required to escape tab and backspace. Every character but blank, tab, newline and the list above is always a text character.Character classes. Classes of characters can be specified using the operator pair []. The construction [abc] matches a single character, which may be a, b, orc. Within square brackets, most operator meanings are ignored. Only three characters are special: these are \ - and ^. The - character indicates ranges. For example, [a-z0-9<>_]indicates the character class containing all the lower case letters, the digits, the angle brackets, and underline. Ranges may be given in either order. Using -between any pair of characters which are not both upper case letters, both lower case letters, or both digitsis implementation dependent and will get a warning message. (E.g., [0-z] in ASCII is many more characters than it is in EBCDIC). If it is desired to include the character - in a character class, it should be first or last; thus[-+0-9]matches all the digits and the two signs.In character classes, the ^ operator must appear as the first character after the left bracket; it indicates that the resulting string is to be complemented with respect to the computer character set. Thus[^abc]matches all characters except a, b, or c, including all special or control characters; or[^a-zA-Z]is any character which is not a letter. The \ character provides the usual escapes within character class brackets.Arbitrary character. To match almost any character, theoperator character . is the class of all characters except newline. Escaping into octal is possible although non-portable:[\40-\176]matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).Optional expressions. The operator ? indicates an optional element of an expression. Thusab?cmatches either ac or abc.Repeated expressions. Repetitions of classes are indicated by the operators * and +.a*is any number of consecutive a characters, including zero; whilea+is one or more instances of a. For example,[a-z]+is all strings of lower case letters. And[A-Za-z][A-Za-z0-9]*indicates all alphanumeric strings with a leading alphabetic character. This is a typical expression for recognizing identifiers in computer languages. Alternation and Grouping. The operator | indicates alternation:(ab|cd)matches either ab or cd. Note that parentheses are used for grouping, although they are not necessary on the outside level;ab|cdwould have sufficed. Parentheses can be used for more complex expressions:(ab|cd+)?(ef)*matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.Context sensitivity. Lex will recognize a small amount of surrounding context. The two simplest operators for this are ^ and $. If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the beginning of the input stream). This can never conflict with the other meaning of ^, complementation of character classes, since that only applies within the [] operators. If the very last character is $, the expression will only be matched at the end of a line (when immediately followed by newline). The latter operator is a special case of the / operator character, which indicates trailing context. The expressionab/cdmatches the string ab, but only if followed by cd. Thus ab$is the same asab/\nLeft context is handled in Lex by start conditions as explained in section 10. If a rule is only to be executed when the Lex automaton interpreter is in start condition x, the rule should be prefixed by<x>using the angle bracket operator characters. If we considered ``being at the beginning of a line'' to be start condition ONE, then the ^ operator would be equivalent to<ONE>Start conditions are explained more fully later.Repetitions and Definitions. The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). For example{digit}looks for a predefined string named digit and insertsit at that point in the expression. The definitions are given in the first part of the Lex input, before the rules. In contrast,a{1,5}looks for 1 to 5 occurrences of a.Finally, initial % is special, being the separator for Lex source segments.4. Lex Actions.When an expression written as above is matched, Lex executes the corresponding action. This section describes some features of Lex which aid in writing actions. Note that there is a default action, which consists of copying the input to the output. This is performed on all strings not otherwise matched. Thus the Lex user who wishes to absorb the entire input, without producing any output, must provide rules to match everything. When Lex is being used with Yacc,this is the normal situation. One may consider that actions are what is done instead of copying the input to the output; thus, in general, a rule which merely copies can be omitted. Also, a character combination which is omitted from the rules and which appears as input is likely to be printed on the output, thus calling attention to the gap in the rules.One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ; as an action causes this result. A frequent rule is[ \t\n] ;which causes the three spacing characters (blank, tab, and newline) to be ignored.Another easy way to avoid writing actions is the action character |, which indicates that the action for this rule is the action for the next rule. The previous example could also have been written" ""\t""\n"with the same result, although in different style. The quotes around \n and \t are not required.In more complex actions, the user will often want to know the actual text that matched some expression like [a-z]+. Lex leaves this text in an external character array named yytext. Thus, to print the name found, a rule like[a-z]+ printf("%s", yytext);will print the string in yytext. The C function printf accepts a format argument and data to be printed; in this case, the format is ``print string'' (% indicating data conversion, and s indicating string type), and the data are the characters in yytext. So this just places the matched string on the output. This action is so common that it may be written as ECHO:[a-z]+ ECHO;is the same as the above. Since the default action is just to print the characters found, one might ask why give a rule, like this one, which merely specifies the default action? Such rules are often required to avoid matching some other rule which is not desired. For example, if there is a rule which matches read it will normally match the instances of read contained in bread or readjust; to avoid this, a rule of the form [a-z]+is needed. This is explained further below.Sometimes it is more convenient to know the end of what has been found; hence Lex also provides a count yyleng of the number of characters matched. To count both the number of words and the number of characters in wordsin the input, the user might write [a-zA-Z]+ {words++; chars += yyleng;} which accumulates in chars the number of characters in the words recognized. The last character in the string matched can be accessed byyytext[yyleng-1]Occasionally, a Lex action may decide that a rule has not recognized the correct span of characters. Two routines are provided to aid with this situation. First, yymore() can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. Normally, the next input string would overwrite the current entry in yytext. Second, yyless (n) may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. The argument n indicates the number of characters in yytext to be retained. Further characters previously matched are returned to the input. This provides the same sort of lookahead offered by the / operator, but in a different form.Example: Consider a language which defines a string as a set of characters between quotation (") marks, and provides that to include a " in a string it must be preceded by a \. The regular expression which matches that is somewhat confusing, so that it might be preferable to write\"[^"]* {if (yytext[yyleng-1] == '\\')yymore();else... normal user processing}which will, when faced with a string such as "abc\"def" first match the five characters "abc\; then the call to yymore() will cause the next part of the string, "def, to be tacked on the end. Note that the final quote terminating the string should be picked up in the code labeled ``normal processing''.The function yyless() might be used to reprocess textin various circumstances. Consider the C problem of distinguishing the ambiguity of ``=-a''. Suppose it is desired to treat this as ``=- a'' but print a message.A rule might be=-[a-zA-Z] {printf("Op (=-) ambiguous\n");yyless(yyleng-1);... action for =- ...}which prints a message, returns the letter after the operator to the input stream, and treats the operator as ``=-''. Alternatively it might be desired to treat this as ``= -a''. To do this, just return the minus sign as well as the letter to the input:=-[a-zA-Z] {printf("Op (=-) ambiguous\n");yyless(yyleng-2);... action for = ...}will perform the other interpretation. Note that the expressions for the two cases might more easily be written=-/[A-Za-z]in the first case and=/-[A-Za-z]in the second; no backup would be required in the rule action. It is not necessary to recognize the whole identifier to observe the ambiguity. The possibility of ``=-3'', however, makes=-/[^ \t\n]a still better rule.In addition to these routines, Lex also permits access to the I/O routines it uses. They are:1) input() which returns the next input character;2) output(c) which writes the character c on the output; and3) unput(c) pushes the character c back onto the input stream to be read later by input().By default these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define therelationship between external files and internal characters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines;a value of zero returned by input must mean end of file; and the relationship between unput and input must be retained or the Lex lookahead will not work. Lex does not look ahead at all if it does not have to, but every rule ending in + * ? or $ or containing / implies lookahead. Lookahead is also necessary to match an expression that is a prefix of another expression. See below for a discussion of the character set used by Lex. The standard Lex library imposes a 100 character limit on backup.Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file. If yywrap returnsa 1, Lex continues with the normal wrapup on end of input. Sometimes, however, it is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a yywrap which arranges for new input and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.This routine is also a convenient place to print tables, summaries, etc. at the end of a program. Note that it is not possible to write a normal rule which recognizes end-of-file; the only access to this condition is through yywrap. In fact, unless a private version of input() is supplied a file containing nulls cannot be handled, since a value of 0 returned by input is taken to be end-of-file.5. Ambiguous Source Rules.Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex chooses as follows:1) The longest match is preferred.2) Among rules which matched the same number of characters, the rule given first is preferred.Thus, suppose the rulesinteger keyword action ...;[a-z]+ identifier action ...;to be given in that order. If the input is integers, it is taken as an identifier, because [a-z]+ matches 8 characters while integer matches only 7. If the inputis integer, both rules match 7 characters, and the keyword rule is selected because it was given first. Anything shorter (e.g. int) will not match the expression integer and so the identifier interpretation is used.The principle of preferring the longest match makes rules containing expressions like .* dangerous. For example, '.*' might seem a good way of recognizing a string in single quotes. But it is an invitation for the program to read far ahead, looking for a distant single quote. Presented with the input'first' quoted string here, 'second' herethe above expression will match'first' quoted string here, 'second'which is probably not what was wanted. A better rule is of the form'[^'\n]*'which, on the above input, will stop after 'first'. The consequences of errors like this are mitigated by the fact that the . operator will not match newline. Thus expressions like .* stop on the current line. Don't try to defeat this with expressions like (.|\n)+ or equivalents; the Lex generated program will try to read the entire input file, causing internal buffer overflows.Note that Lex is normally partitioning the input stream, not searching for all possible matches of each expression. This means that each character is accounted for once and only once. For example, suppose it is desired to count occurrences of both she and he in aninput text. Some Lex rules to do this might beshe s++;he h++;\n |. ;where the last two rules ignore everything besides he and she. Remember that . does not include newline. Since she includes he, Lex will normally not recognize the instances of he included in she, since once it has passed a she those characters are gone.Sometimes the user would like to override this choice. The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed. The position of the input pointer is adjusted accordingly. Suppose the userreally wants to count the included instances of he:she {s++; REJECT;}he {h++; REJECT;}\n |. ;these rules are one way of changing the previous example to do just that. After counting each expression, it is rejected; whenever appropriate, the other expression will then be counted. In this example, of course, the user could note that she includes he but not vice versa, and omit the REJECT action on he; in other cases, however, it would not be possible a priori to tell which input characters were in both classes.Consider the two rulesa[bc]+ { ... ; REJECT;}a[cd]+ { ... ; REJECT;}If the input is ab, only the first rule matches, and on ad only the second matches. The input string accb matches the first rule for four characters and then the second rule for three characters. In contrast, the input accd agrees with the second rule for four characters and then the first rule for three.In general, REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other. Suppose a digram table of the input is desired; normally the digrams overlap, that is the word the is considered to contain both th and he. Assuming a two-dimensional array named digram to be incremented, the appropriate source is%%[a-z][a-z] {digram[yytext[0]][yytext[1]]++;REJECT;}. ;\n ;where the REJECT is necessary to pick up a letter pair beginning at every character, rather than at every other character.6. Lex Source Definitions.Remember the format of the Lex source:{definitions}%%{rules}%%。