Project results

In the project semester, we studied the behavior and the performance of the database by benchmarking the requests on the database, and exploited the multi-zone model where missing data were imputed and then smoothed using the Random Forest.

1. Random query of nodes

The comparative analysis of the nodes carried out in the project covered only the first 30 nodes. This was corrected in the benchmarking script by querying all the nodes in the database (71 nodes) at random. By entering an execution number that represents the number of times the command is executed before its time execution is evaluated, the benchmarking is performed in such a way that the order of the nodes is different for each execution.

Here is an example of a command that, when executed, exports the results to a csv file named named time-elm-freq.csv and produces the figure bellow which represents the total execution time of the query and the application of the imputation and smoothing algorithms, as a function of a sampling period of 60 min, for all the nodes (71 nodes), over a period of one year (from January 1, 2019 at 00:00:00 to December 31, 2019 at 23:59:59), and with a number of executions equal to 5 (the command is executed 5 times, and at each execution, the order of the nodes is different).

python3 benchmarking.py
  --in 60 (1)
  --v (2)
  --o node (3)
  --n 71 (4)
  --p 5 (5)
  --sd 2019-01-01T00:00:00Z (6)
  --ed 2019-12-31T23:59:59Z (7)
  --data_algo smooth (8)

1	sampling period in minutes
2	print verbose
3	option choice (node or zone)
4	number of nodes or zones to request
5	the number of times each query is repeated before its time execution is evaluated
6	start date
7	end date
8	choose between data algorithms (impute or smooth)

The parameters of the linear regression were also represented on the plot (steering coefficient + intercept at the origin) in order to study the linearity of the benchmark line.

From this graph, we notice that the total execution time (query, imputation and somoothing algorithms) is an increasing positive function of the number of nodes requested. Indeed, the study of the database response takes more time when more nodes are queried.

It is then clear that the benchmark curve has a linear character (we can see it in relation to the regression line \(y = 236.09 x + 948.67\) such that y represents the total execution time and x the number of nodes to request).

In order to detail the results obtained, a study of the different execution times was carried out (by taking the average of the 5 executions):

Table 1. Calculation of the different execution times
Number of nodes	Query execution time	Imputation execution time	Smoothing execution time	Total execution time
4	03min54s	14min47,41s	41,81s	19min30,59s
12	05min46,42s	44min26,84s	02min15,15s	52min28,40s
20	10min32,82s	01h13min20,32s	03min35,31s	01h27min28,46s
28	14min39,06s	01h40min33,87s	05min04,91s	02h00min17,84s
36	15min09,35s	02h09min37,86s	06min37,47s	02h31min07,84s
44	16min30,09s	02h37min21,86s	08min07,93s	03h01min59,88s
52	20min50,22s	03h05min41,42s	09min12,26s	03h35min59,76s
60	23min28,20s	03h37min13,42s	10min53,11s	04h11min34,74s
68	28min52,78s	04h06min39,23s	12min23,46s	04h47min55,48s
71	28min23,13s	04h13min54,76s	12min41,05s	04h54min35,52s

We can clearly see that the more the number of nodes in question increases, the more the execution time increases in a linear way for each of the steps (query, imputation and smoothing).

And from the table above, we notice that the time of data imputation is much more important compared to the other times. The results obtained allowed to calculate the average of each time and realize the following pie chart where we can see that the imputation time represents 86.1% of the total execution time, while the query represents only 9.5% and the smoothing 4.37% :

2. Test imputation algorithm

The data collected by the sensors contain missing data. To complete these data, we have implemented an imputation algorithm based on the random forest algorithm which uses a time window equal to 10.

In order to verify the efficiency and performance of the imputation algorithm, we implemented a test impute script where we created a data hole in a csv file named export_test_impute by deleting a time margin \(\Delta(t)\) from a fixed start date in order to fill it with the imputation algorithm.

The results obtained by varying the \(\Delta(t)\) are presented below :

By examining the results obtained below, we observe that each time the \(\Delta(t)\) increases, the difference between the actual and the imputed values also increases.

To confirm this observation, we calculated the MSE(mean squared error) between the real and the ifered values for each \(\Delta(t)\).

The following histogram represents the MSE value for each \(\Delta(t)\):

In analyzing the above histogram we came to the following conclusion : there is a direct relationship between \(\Delta(t)\) and MSE. Each time the \(\Delta(t)\) increases the MSE increases as well, this confirms the previous hypothesis.