Formal language

Languages, grammar and automaton:

How to study Automaton:

Methods on Automaton:


  • LR
  • LL
  • SLR
  • LALR

In language theory, the set of elementary entities is called the alphabet. A combination of elementary entities is called a word. A set of words is called a language and is described by a grammar. From a grammar, we can build an effective procedure (called automaton) to decide if a word is part of the language.

There are different classes of languages, corresponding to different classes of grammars and automata. Among the simplest to study, the class of regular languages corresponds to regular grammars and finite automata. This grammar class is typically used to describe home automation without cycles.

A more global class than regular languages is the class of out-of-context languages, corresponding to context-free grammars and psuhdown automata. This grammar class, more powerful than the class of regular grammars, is typically used for home automation with known cycles.

The theory of languages is useful to:

  • describe an automatic behavior (robotics, home automation, building automation)
  • understand all of a language and its uses (compiler, DNA, machine learning)
  • minimize an automatic process (integrated software, integrated map for aerospace or automotive, etc.)


An alphabet, noted A, is a non-empty finite set of symbols (letters, signs, numbers, etc.)

A word, defined on an alphabet A, is a finite sequence of symbols / elements of A.

Some properties on words:

  • The length of a word u defined on an alphabet A, written |u| is the number of symbols (his cardinal) that make up u.
  • |u|j counts the number of occurrences of the symbol j in the word u.
  • Le mot vide, noté ε, est défini sur tous les alphabets et est le mot de longueur 0.
  • L’ensemble de tous les mots formés à partir de l’alphabet A (resp. de tous les mots non vides) est noté A* (resp. A+).
  • La concaténation de deux mots u et v, notée, u.v ou uv, est le mot formé en faisant suivre les symboles de u par les symboles de v. Une puissance n d’un mot est ce mot concaténé n fois.
  • si un ou des symboles est à la puissance + alors cela signifie qu’il existe un ou plus symbole de ce type dans le mot (ex. (ab)+=ab…)
  • si un ou des symboles est à la puissance * alors cela signifie qu’il existe zéro ou plus symbole de ce type dans le mot


Let two words u and v be defined on an alphabet A:

  • u is a prefix of v if and only if there exists a (potentially empty) word w such that uw = v.
  • u is a suffix of v if and only if there exists a (potentially empty) word w such that wu = v.
  • u is a factor of v if and only if there are two words (potentially empty) w1 and w2 such that w1uw2=v.
  • a non-empty word u is said primitive if the equation u=vi hasn’t solution for i>1.
  • two words x = uv and y = vu deduced from each other by exchange of prefix and suffix are said conjugated.


A language, defined on an alphabet A, is a set of words defined on A. A language is a subset of A*.

Set operations defined on languages:
  • The union contains all the words that are contained in L1 or in L2.
  • The intersection is the language that contains all the words contained in both L1 and L2.
  • The complement of a language is the language containing all the words that are not in that language.
  • Difference between two languages is the language containing all the words of L1 that are not in L2.

In addition to the set operations on the languages, the product or concatenation of two languages and defined by the language containing all the words formed by the concatenation of a word of L1 following by a word of L2. The product of languages is associative but not commutative.

The power of a language is a successive concatenation of this language Ln=L.Ln-1. Power 0 forms the compound language of the empty word.

Let L1 = {00, 11} and L2 = {0, 1, 01} be two languages on {0,1}.
L1.L2 = {000, 001, 0001, 110, 111, 1101}.


A language can be described by a number of rules: the grammar indicates how to construct sentences belonging to the language (functioning in production); the grammar also makes it possible to decide whether a given sentence belongs to the language or not (recognition operation).

A grammar is a quadruplet G = (T, N, S, R) as:

  • T is the terminal vocabulary, that is to say the alphabet on which the language is defined.
  • N is the non-terminal vocabulary, that is, the set of symbols that do not appear in the generated words, but are used during the generation. A non-terminal symbol designates a syntactic category.
  • R is a set of rules called rewriting or production of the form: u1 → u2, with u1 ∈ (N ∪ T)+ and u2 ∈ (N ∪ )*. Remember that + means at least one element, and * means 0 or more. The intuitive meaning of these rules is that the non-empty sequence of terminal or non-terminal symbols u1 can be replaced eventually with the eventual empty of terminal or non-terminal symbols u2.
  • S ∈ N is the starting symbol or axiom. It is from this non-terminal symbol that we will begin the generation of words by means of the rules of grammar.

When several grammar rules have the same left-hand shape, one
can factor these different rules by separating the straight parts by vertical lines. For example, the set of rules S → ab, S → aSb, S → c can be written S → ab | aSb | c.

The language defined, or generated, by a grammar is the set of words that can be obtained from the starting symbol by applying the rules of the grammar. More formally, we introduce the notions of derivation between forms, this consists in replacing a non-terminal symbol by a sequence of terminal and non-terminal symbols according to the grammar rules.

For more information on language-grammar operations, please refer to the links table at the top of this course.