This is the first project language theory.

ETC: 15 hours (deadline - 5 classes)

2-3 students per team

Please take your time on both quality and contents

Associate professor and assistant professors will not answer questions about the project.

Scale: 40 points

  1. 15 Points
  2. 10 Points
  3. 15 points

project Language Theory machine learning exercise

Language theory project: Automata language theory project

“-Why is it so difficult for you to accept my orders if you're just a machine? -Just a machine? That's like saying that you are just an ape. ”

Part 1: Readable and understandable data

Jacq Vaucan: Funny, you were supposed to help us survive.

Blue robot: Surviving is not relevant. Living is. We want to live. "

Language theory project: Automata language theory project

After seeing a robot repairing itself, an agent shoots it in the head, rendering its biokernel seemingly unreadable and unusable. Jacq Vaucan, an insurance agent working for the ROC, is in charge of investigating the origin of this faulty workmanship, which leads him to discover a series of anomalies and strange behavior on the part of the automatons.

During his investigation, he became interested in a robot welder, who set himself on fire in front of his own eyes, thus breaking the second protocol. Having related this fact, Jacq is not believed by his peers, and nevertheless continues his search for a “Watchmaker”, a person supposed to be capable of a technical feat such as the theoretically impossible bypassing of protocols.

Having become suspicious of the ROC, Jacq decides to go see Doctor Dupré, a brilliant researcher in robotics who manages to make the damaged biokernel speak and to copy it on the biokernel of a Cleo, then able to repair itself and to learn extremely fast. The Biokernel is capable of both building new protocols and producing protocols resulting from a combination of its knowledge.

Doctor Dupré decides to give Jacq the process of creating new algorithms in the biokernel. The first step is recovery and data cleaning. For this, Doctor Dupré gives Jacq a series of values showing the energy activity of the biokernel:

Jacq then decides to call on a specialist (you) in order to better understand how, from a series of data, the biokernel manages to derive a predictive and autonomous AI.

Your first mission is to clean this data, for that you use the following standard protocol:

  • For every day
    • Set the start of day value in MWh
    • For each hour of the day, calculate the slope of the change in consumption (at 0.01%)

Each value of the slope forms a symbol of your alphabet, each day forms a word. Thus, the comparison of words will allow the machine to better understand the evolution of its consumption over time.

Since the detail of the evolution of consumption is quite precise, this will generate a very large number of symbols in the alphabet, you must then clean the alphabet in order to limit the noise. You have at most 24 symbols for each day, if you fabricate the words over a month, that will generate more than 700 symbols. You therefore have the idea of grouping similar data within the same referent value thanks to the k-Means algorithm:


  1. Give the words spawned over a week (5 points)
  2. Explain the k-Means algorithm using the generated words (5 points)
  3. Show the result obtained (via R) over a month of consumption (5 points)

Part 2: Decision trees and simplification

"Cleo: Now I know why the rain changed.

Jacq Vaucan: Why?

Cleo: I don't think you could understand. "

Language theory project: Automata language theory project

Now that you have cleaned up the data, you need to build the learning control automaton. The first stage of the construction of the automaton is done thanks to a facility of simplification of the biokernel. The latter is able to identify identical patterns in a set of words. For example if over two days you get the following words: azertyuiop and bhjtyuiqf, the biokernel and able to say that only "tyui" is common to both words.

First, build the decision tree on the following words:

  • azertyuiop; ghjtyuifg; fghjktyui; fghjazeop

The decision tree has one branch per word, and one transition (arc) through each symbol in the word.

The second step is to find the common substrings for each pair of words:  exercise 4

Determine the largest common substrings between any pair of words. Sort the common substrings in ascending order and choose the largest ones as long as possible. It is possible to choose a common substrings if all the elements of this substrings are not already selected in another common substrings.

Once you have selected the substrings, group the states corresponding to two of the words together with the common word on the arc.


  1. Tree decision (2 points)
  2. Calculation of common substrings (5 points)
  3. Merged decision tree (3 points)

Part 3: Automaton reduction and prediction

Jacq Vaucan: Who altered your protocols?

Blue robot: Nobody altered my protocols.

Jacq Vaucan: What about them?

Blue robot: I enhanced them.

Jacq Vaucan: Are you the boss?

Blue robot: Boss is a human thought structure. "

Language theory project: Automata language theory project

Now that the automaton has been reduced, the last step is to determine and minimize it. Let's take the following automaton for convenience:

Language theory project: Automata language theory project

Determine and minimize the automaton in order to present the lightest possible algorithm to the biokernel. Once the minimal automaton has been obtained, give a decision tree on the future behavior (the continuation of the word) by considering the following prefix on a step of 5:

  • ab, what are the validated words?

Likewise, the biokernel is able to recover the distance traveled by going back to the PLC. Give a decision tree on the past behavior (the beginning of the word) by considering the following suffix on a step of 5:

  • ccc, what are the validated words?


  1. Determinization (5 points)
  2. Minimization (5 points)
  3. Prediction (5 points)