# Project results

In the project semester, we studied the behavior and the performance of the database by benchmarking the requests on the database, and exploited the multi-zone model where missing data were imputed and then smoothed using the Random Forest.

## 1. Random query of nodes

The comparative analysis of the nodes carried out in the project covered only the first 30 nodes. This was corrected in the benchmarking script by querying all the nodes in the database (71 nodes) at random. By entering an execution number that represents the number of times the command is executed before its time execution is evaluated, the benchmarking is performed in such a way that the order of the nodes is different for each execution.

Here is an example of a command that, when executed, exports the results to a csv file named named time-elm-freq.csv and produces the figure bellow which represents the total execution time of the query and the application of the imputation and smoothing algorithms, as a function of a sampling period of 60 min, for all the nodes (71 nodes), over a period of one year (from January 1, 2019 at 00:00:00 to December 31, 2019 at 23:59:59), and with a number of executions equal to 5 (the command is executed 5 times, and at each execution, the order of the nodes is different).

```python3 benchmarking.py
--in 60 (1)
--v (2)
--o node (3)
--n 71 (4)
--p 5 (5)
--sd 2019-01-01T00:00:00Z (6)
--ed 2019-12-31T23:59:59Z (7)
--data_algo smooth (8)```
 1 sampling period in minutes 2 print verbose 3 option choice (node or zone) 4 number of nodes or zones to request 5 the number of times each query is repeated before its time execution is evaluated 6 start date 7 end date 8 choose between data algorithms (impute or smooth)

The parameters of the linear regression were also represented on the plot (steering coefficient + intercept at the origin) in order to study the linearity of the benchmark line.

From this graph, we notice that the total execution time (query, imputation and somoothing algorithms) is an increasing positive function of the number of nodes requested. Indeed, the study of the database response takes more time when more nodes are queried.

It is then clear that the benchmark curve has a linear character (we can see it in relation to the regression line $y = 236.09 x + 948.67$ such that y represents the total execution time and x the number of nodes to request).

In order to detail the results obtained, a study of the different execution times was carried out (by taking the average of the 5 executions):

Table 1. Calculation of the different execution times
Number of nodes Query execution time Imputation execution time Smoothing execution time Total execution time

4

03min54s

14min47,41s

41,81s

19min30,59s

12

05min46,42s

44min26,84s

02min15,15s

52min28,40s

20

10min32,82s

01h13min20,32s

03min35,31s

01h27min28,46s

28

14min39,06s

01h40min33,87s

05min04,91s

02h00min17,84s

36

15min09,35s

02h09min37,86s

06min37,47s

02h31min07,84s

44

16min30,09s

02h37min21,86s

08min07,93s

03h01min59,88s

52

20min50,22s

03h05min41,42s

09min12,26s

03h35min59,76s

60

23min28,20s

03h37min13,42s

10min53,11s

04h11min34,74s

68

28min52,78s

04h06min39,23s

12min23,46s

04h47min55,48s

71

28min23,13s

04h13min54,76s

12min41,05s

04h54min35,52s

We can clearly see that the more the number of nodes in question increases, the more the execution time increases in a linear way for each of the steps (query, imputation and smoothing).

And from the table above, we notice that the time of data imputation is much more important compared to the other times. The results obtained allowed to calculate the average of each time and realize the following pie chart where we can see that the imputation time represents 86.1% of the total execution time, while the query represents only 9.5% and the smoothing 4.37% :

## 2. Test imputation algorithm

The data collected by the sensors contain missing data. To complete these data, we have implemented an imputation algorithm based on the random forest algorithm which uses a time window equal to 10.

In order to verify the efficiency and performance of the imputation algorithm, we implemented a test impute script where we created a data hole in a csv file named export_test_impute by deleting a time margin $\Delta(t)$ from a fixed start date in order to fill it with the imputation algorithm.

The results obtained by varying the $\Delta(t)$ are presented below :

By examining the results obtained below, we observe that each time the $\Delta(t)$ increases, the difference between the actual and the imputed values also increases.

To confirm this observation, we calculated the MSE(mean squared error) between the real and the ifered values for each $\Delta(t)$.

The following histogram represents the MSE value for each $\Delta(t)$:

In analyzing the above histogram we came to the following conclusion : there is a direct relationship between $\Delta(t)$ and MSE. Each time the $\Delta(t)$ increases the MSE increases as well, this confirms the previous hypothesis.