What is lexical analysis?

I was taking a look at the source code of a known php library, called Twig (it is a template engine, with its own syntax), and I came across classes, interfaces and methods, such as Lexical, Lex and LexInterface.

I did a survey and realized that it was the term lexical analysis.

Although I understood some things, I was confused in others.

For example, I am used to seeing the term Parser or Parsing, when it deals in transformation of a given data into another data.

  • What would lexical analysis be?

  • Lexical parsing and Parsing / Parser are these the same things, or are they actually different things?

Sorry if I'm confused on the question, but I think the community will help me with a good and enlightening answer.

Author: Wallace Maxters, 2016-02-16

6 answers

Definition

Lexical analysis is the process performed on a text, say a computer program or else a markup language such as HTML, which divides it into lexemes and converts them to a sequence of tokens, which are used to feed a text into a text.parser. A program that performs lexical analysis is usually called lexer, scanner or tokenizer.

Tokens / lexemes

Lexemes are syntactic units relevant to the lexer context, for example for a lexer aimed at a certain programming language some lexemes could be: 1, "Hello", for, ==, variableName, function.

A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme group and a possible value of this case the group is not unique. Example of some tokens (the token is represented in the format and "#" represents start of comments):

< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n

Difference between parsing

Lexers and parsers are closely related yet are distinct concepts. Lexer specializes in extracting the relevant portions of the text and turning them into structures with "meaning", in the case of tokens. The parser has the function of analyzing the syntactic structure of a text, for example to tell if in a given programming language the expression "olá" 1 == "outroliteral" is syntactically valid or Invalid. The link between the two is that the structures produced by lexer are the which parser uses as a source to perform parsing, that is, the parser works with tokens. This is not mandatory, nothing prevents you from building a parser that does parsing on top of plain text, however the separation of the two tasks has some advantages:

  1. Design simplification. A parser that also performs the work of a lexer is significantly more complex.
  2. optimization. Separating the two tasks (lexing and parsing) you have more freedom to apply optimization techniques specific to each task.

A lexer in practice

Theory is essential, yet nothing better than real code to aid understanding. Here I show a handmade lexer in javascript that handles a subset of CSS, more specifically it handles this example:

h1 {
    color: red;
    font-size: 30px;
}

body {
    background-color: yellow;
}

div {
    margin: 10px;
    padding: 10px;
}

The code can be executed and will show the list of tokens generated after processing our target CSS:

// nosso CSS
var css = "h1 { \n\
	color: red; \n\
	font-size: 30px; \n\
} \n\
 \n\
body { \n\
	background-color: yellow; \n\
} \n\
 \n\
div { \n\
	margin: 10px; \n\
	padding: 10px; \n\
} \n\
";

/**
* Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto.
*/
var TOKENS = [
	{name: 'EMPTY', regex: /^(\s+)/ },
	{name: 'RULE_SET_START', regex: /^({)/ },
	{name: 'RULE_SET_END', regex: /^(})/ },
	{name: 'RULE_DEFINITION', regex: /^(:)/ },
	{name: 'RULE_END', regex: /^(;)/ },
	{name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ },
	{name: 'RULE', regex: /^([a-z][-a-z]+)(?=\s*:)/ },
	{name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ }
];


var text = css;
var outputTokenList = [];
while(text !== '') { // enquanto existir texto a ser consumido

	var hasMatch = false;
	
	/**
  	 * Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto.
  	 * Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer.
  	 * Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro.
	 *
	 */
	for (var i=0; i<TOKENS.length; i++) {
		var obj = TOKENS[i];
		var tokenName = obj.name;
		var tokenRegex = obj.regex;
		
		var matched = text.match(tokenRegex);
		if (!matched) {
			continue;
		}
		
		hasMatch = true;
		var lexeme = matched[1];
		
		// removemos do texto o lexeme encontrado
		// para que outro possa ser considerados
		// na próxima iteração
		text = text.substring(lexeme.length);

		if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) {
			// adicionamos tanto o nome do token quanto o lexeme
			outputTokenList.push([tokenName, lexeme]);
		}
		else if (tokenName in {'EMPTY': 1}) {
			// discard, não nos importamos com espaços e quebras de linha.
		}
		else {
			// nestes casos o relacionamento entre o nome do token
			// e o lexeme é 1<->1 então não precisamos adicionar o lexeme.
			outputTokenList.push([tokenName]);			
		}
		
		break;
	};

	if (!hasMatch) {
		throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n')));
		break;
	}
}

outputTokenList.forEach(function(token) {
	document.write('< ' + token[0]);
    if (token.length > 1) {
	    document.write(' , ' + token[1]);
    }
	document.write(' ><br>');
});

No I will go into explanations about lexers implementations because this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible functioning of a lexer, reads a target text and generates tokens as output, the code is neither efficient nor robust.

 29
Author: BrunoRB, 2016-02-22 16:47:51

What would lexical analysis be?

Is the analysis of lines of characters to create symbols that make it easier to understand what is written there.

Lexical parsing and Parsing / Parser are these the same things, or are they actually different things?

Are complementary.

The first step is the generation of symbols (tokens), called lexical analysis, by which the input character stream is divided into significant symbols defined by a regular expression grammar.

The second step is parsing, which is checking that tokens form an allowed expression.

The final stage is semantic analysis, which elaborates the implications of the validated expression and takes the appropriate action.

 11
Author: Rodrigo Guiotti, 2016-02-19 11:32:13

Lexical and syntactic analysis ( parsing) are in fact very similar things, considering that the algorithms that work on these two fronts operate in a similar way: the input processed by both is similar and the results they present as well.

As they mentioned in the comments of the question, in fact there was a lot of this expression when studying Automata precisely because lexical parsers deal only with regular languages. In practice, a lexical analyzer acts as a token recognizer. That is, given the grammar of any regular language, a lexical parser is able to determine whether a given string of characters is part of that language or not. Hence comes the famous regular expressions.

Differently, parsers work with a more complex level of languages because they deal with context-free languages. Also in practice, parsers often process a sequence of tokens (usually recognized by lexers) and determine whether such tokens satisfy the structures defined in the grammar of that language.

Therefore, lexical analysis consists in validating tokens taking into account the rules of formation of a regular language, if that language is no longer regular, then it is a parser. In the case of Twig, because it is a template engine, I believe that lexical analysis occurs in the recognition of special markers such as {{, else, {% etc.


I am updating my answer because I believe it was not broad enough and also because I think the other answers presented were either too generic or tied too much lexers to parsers.

First the fundamental similarities between lexers and parsers:

  1. Both have as input symbols of some alphabet. Usually the symbols of a lexer are ASCII or Unicode characters while parsers process tokens (symbols grammar terminals that are validating).

  2. They analyze these symbols by associating them with a grammar to recognize them as members of a language. As I explained above, lexers only validate regular languages while parsers operate in context-free languages. Levels 3 and 2 in the Chomsky hierarchy , respectively.

  3. Both produce as output sentences of the language they are evaluating. Lexers distinguish tokens from a string given as input, while parsers generate syntactic trees.

With respect to the last point, although both are complementary in the vast majority of cases, this does not mean that a parser will always have to receive its tokenized input from a lexer. Being a parser capable of generating multiple sentences of any L1 language, we can obtain an L2 language whose tokens are sentences of L1. Thus, parsers can also be tokenizers of other parsers.

In natural language processing, parsers get their tokenized input from a series of steps that may involve manual editing, statistical analysis, and machine learning.

Therefore, the fundamental difference between the two is as I have already explained above: the level of the language in which they operate and consequently the complexity required to operate under this level. While for regular languages finite state Automata are sufficient, context-free languages require more complex mechanisms like stack Automata and all the huge variation of existing parsers (bottom-up, top-down, Tabular etc).

 9
Author: Roney Gomes, 2016-02-21 23:32:54

An interpreter / compiler does not understand "thick text" directly, it needs the data well organized and defined.

For example, let's assume that our interpreter needs to interpret this simple sum:

42 + 42

How do we solve this? This is where the role of the lexical analyzer comes in.

The programmer who created our imaginary interpreter defined that a number is the set of 1 digit followed by other digits and that sum is simply the symbol '+'.

[0-9]+ returns NUMBER;
"+"    returns PLUS;
" "    ;

What Now? Well, let's analyze what happens when it parses the input, the endpoint being the character it is parsing:

1) . 42 (our parser contains the number 4 in the buffer, it knows that by definition, a number is 1 digit followed by more digits.)

2) 4 . 2 (so far, it has 4 as the number and continues, storing 2 in the buffer.)

3) 42 . (is the end of number, our interpreter returns NUMBER. We find a blank space, which by definition, does not return anything.)

For now, we know this:

<NUMBER, 42> . + 42

The parser is on the'+', IT knows that by definition it is a PLUS:

<NUMBER, 42> <PLUS, '+'> . 42

And it's the same process about 42:

<NUMBER, 42> <PLUS, '+'> <NUMBER, 42>

The analysis has been completed, what now? Our interpreter can interpret this data consistently. Let's assume that our interpreter uses a very simple and restrictive grammar for sums:

sum -> NUMBER PLUS NUMBER

It can use the output tokens of the lexical parser and focus only on parsing. Since the output consists of NUMBER PLUS NUMBER, it fits as a valid sum.

 6
Author: Victor Martins, 2016-02-24 17:09:10

The answers presented are very good, and have already addressed the entire technical part of the subject. Therefore, only complementing the information, follows an explanation focused only on the etymology of words.


Lexical Analysis

Refers to the language vocabulary (words). In a simple way, it is a dictionary analysis and verifies the existence of terms/words within the language.
For example "Carnival"," egg"," Ball " are part of the lexicon of the Portuguese language. Already the words "party", "egg", "ball" are part of the lexicon of the English language. The lexical analysis is not concerned with order or meaning of terms, but only with the terms themselves.

You can see a technical example here .

Parsing

Refers to the grammatical rules of the language, that is, it is about how we can organize the terms of the language to create a sense. Can face such as the way in which a command should be structured to perform an action, or the rules for the formation of a sentence.

Technical example here .

Semantic Analysis

Here we are talking about the meaning/meaning employed in the sentence/command. For example, we can use the phrase Me inclui fora dessa, where the words and syntax are correct but semantically not.

Technical example here .


These definitions are part of linguistics as a whole and are used as a way to organize and facilitate understanding of how a Code/process proposes to work.

I chose not to use a technical approach and pass examples of the Portuguese language, as they are equally valid when you think of programming languages and make understanding the terms simpler.

 5
Author: Bruno Oliveira, 2016-02-23 19:18:43

Everything has been said; however I like language processors, and I can't resist putting here an example in this case Lex+Yacc.

Statement: given a CSS (simplified, I will take as an example the case presented by @ BrunoRP) calculate how many properties each tag has.

Translating Grammar: Grammar + semantic actions

Parser = parser (yacc)

%{
#include <stdio.h>
%}

%union {char *s; int i;}
%token <s> ID STR MEA
%type  <i> props

%%
css  :                                      // bloco*
     | css bloco
     ;
bloco: ID '{' props '}'  { printf("%s - %d\n",$1,$3);}
     ;
props:                   { $$ = 0    ;}     // prop*
     | props prop        { $$ = $1+1 ;}     // contar as prop
     ;
prop : ID ':' v ';'      ;                  // propriedade
v    : ID | STR | MEA    ;                  // valor
%%
#include "lex.yy.c"

yyerror(char *s){fprintf(stderr,"ERRO (linha %d):'%s'\n-%s",yylineno, yytext,s);}

int main() { yyparse(); return 0; }

Lexical analysis (flex)

%option yylineno
%option noyywrap

%%
[ \n\t]+              {  } // ignorar espaços
#.*                   {  } // ignorar comentários

[{}:;]                { return yytext[0];} // caracteres especiais

[a-zA-Z][-a-zA-Z0-9]* { yylval.s=strdup(yytext);  return ID;  }
\"[^\"]*\"            { yylval.s=strdup(yytext);  return STR; }
[0-9]+(px|pt|mm)      { yylval.s=strdup(yytext);  return MEA; }

.  {fprintf(stderr,"%d: Erro - invalid character '%c'\n",yylineno,yytext[0]);}

%%

What is Analysis lexicon:

  1. skip spaces and comments
  2. return reserved word codes and special characters
  3. Group the characters that form the IDs, etc; return their code and value
  4. dealing with lexical errors

Compiling and testing:

$ flex proc.l
$ yacc proc.y
$ cc -o proc y.tab.c
$ proc < ex.css      ##ex.css é o exemplo de css do @BrunoRP

h1 - 2
body - 1
div - 2
 5
Author: JJoao, 2016-02-26 12:57:31