# Project results

In the project semester, we studied the behavior and the performance of the database by benchmarking the requests on the database, and exploited the multi-zone model where missing data were imputed and then smoothed using the Random Forest.

## 1. Random query of nodes

The comparative analysis of the nodes carried out in the project covered only the first 30 nodes. This was corrected in the benchmarking script by querying all the nodes in the database (71 nodes) at random. By entering an execution number that represents the number of times the command is executed before its time execution is evaluated, the benchmarking is performed in such a way that the order of the nodes is different for each execution.

Here is an example of a command that, when executed, exports the results to a csv file named named time-elm-freq.csv and produces the figure bellow which represents the total execution time of the query and the application of the imputation and smoothing algorithms, as a function of a sampling period of 60 min, for all the nodes (71 nodes), over a period of one year (from January 1, 2019 at 00:00:00 to December 31, 2019 at 23:59:59), and with a number of executions equal to 5 (the command is executed 5 times, and at each execution, the order of the nodes is different).

python3 benchmarking.py --in 60(1)--v(2)--o node(3)--n 71(4)--p 5(5)--sd 2019-01-01T00:00:00Z(6)--ed 2019-12-31T23:59:59Z(7)--data_algo smooth(8)

1 |
sampling period in minutes |

2 |
print verbose |

3 |
option choice (node or zone) |

4 |
number of nodes or zones to request |

5 |
the number of times each query is repeated before its time execution is evaluated |

6 |
start date |

7 |
end date |

8 |
choose between data algorithms (impute or smooth) |

The parameters of the linear regression were also represented on the plot (steering coefficient + intercept at the origin) in order to study the linearity of the benchmark line.

From this graph, we notice that the total execution time (query, imputation and somoothing algorithms) is an increasing positive function of the number of nodes requested. Indeed, the study of the database response takes more time when more nodes are queried.

It is then clear that the benchmark curve has a linear character (we can see it in relation to the regression line \(y = 236.09 x + 948.67\) such that y represents the total execution time and x the number of nodes to request).

In order to detail the results obtained, a study of the different execution times was carried out (by taking the average of the 5 executions):

Number of nodes | Query execution time | Imputation execution time | Smoothing execution time | Total execution time |
---|---|---|---|---|

4 |
03min54s |
14min47,41s |
41,81s |
19min30,59s |

12 |
05min46,42s |
44min26,84s |
02min15,15s |
52min28,40s |

20 |
10min32,82s |
01h13min20,32s |
03min35,31s |
01h27min28,46s |

28 |
14min39,06s |
01h40min33,87s |
05min04,91s |
02h00min17,84s |

36 |
15min09,35s |
02h09min37,86s |
06min37,47s |
02h31min07,84s |

44 |
16min30,09s |
02h37min21,86s |
08min07,93s |
03h01min59,88s |

52 |
20min50,22s |
03h05min41,42s |
09min12,26s |
03h35min59,76s |

60 |
23min28,20s |
03h37min13,42s |
10min53,11s |
04h11min34,74s |

68 |
28min52,78s |
04h06min39,23s |
12min23,46s |
04h47min55,48s |

71 |
28min23,13s |
04h13min54,76s |
12min41,05s |
04h54min35,52s |

We can clearly see that the more the number of nodes in question increases, the more the execution time increases in a linear way for each of the steps (query, imputation and smoothing).

And from the table above, we notice that the time of data imputation is much more important compared to the other times. The results obtained allowed to calculate the average of each time and realize the following pie chart where we can see that the imputation time represents 86.1% of the total execution time, while the query represents only 9.5% and the smoothing 4.37% :

## 2. Test imputation algorithm

The data collected by the sensors contain missing data. To complete these data, we have implemented an imputation algorithm based on the random forest algorithm which uses a time window equal to 10.

In order to verify the efficiency and performance of the imputation algorithm, we implemented a test impute script where we created a data hole in a csv file named export_test_impute by deleting a time margin \(\Delta(t)\) from a fixed start date in order to fill it with the imputation algorithm.

The results obtained by varying the \(\Delta(t)\) are presented below :

By examining the results obtained below, we observe that each time the \(\Delta(t)\) increases, the difference between the actual and the imputed values also increases.

To confirm this observation, we calculated the MSE(mean squared error) between the real and the ifered values for each \(\Delta(t)\).

The following histogram represents the MSE value for each \(\Delta(t)\):

In analyzing the above histogram we came to the following conclusion : there is a direct relationship between \(\Delta(t)\) and MSE. Each time the \(\Delta(t)\) increases the MSE increases as well, this confirms the previous hypothesis.