SUPPORT FOR EXPERIMENTAL DATA ANALYSIS Forex: The intelligence
The beginning of the following links:
https://www.mql5.com/ru/blogs/post/659572
https://www.mql5.com/ru/blogs/post/659929
Important: last time I deleted several entries of the array, which was looking into the value of a number of different pairs (compound effect together stories 5 pairs). But I did it only for covariates. Today, analyzing the charts, I saw that for looking into the future and there is a problem. Therefore, the previous code should be supplemented with the following lines:
dat_train_final - subset (dat_train_final, dat_train_final $ future_lag_724 0.05 dat_train_final $ future_lag_724 -0.05)
dat_test_final - subset (dat_test_final, dat_test_final $ future_lag_724 0.05 dat_test_final $ future_lag_724 -0.05)
Due to the fact that the vector in the array for training and validation are the difference 724 + - 50 steps, then simply delete only emissions by variable increments 724.
Today: intelligence analysis, graphical study, statistical heterogeneity checking data on the training and validation samples.
Today ID:
### loading working data
dat_train_final - read.csv ( 'C: /R_study/fx/dat_train_final.csv'
, Sep = ','
, Dec = '.')
dat_test_final - read.csv ( 'C: /R_study/fx/dat_test_final.csv'
, Sep = ','
, Dec = '.')
### charts
require (ggplot2)
chart_names - c ( '
price lags', 'mean differences', 'max differences', 'min differences', 'standard deviation', 'price range', 'future
price lags')
# Time series
for (i in 0: 6) {
plot_data - as.data.frame (c (dat_train_final [, 1 + i * 18]
, Dat_train_final [, 2 + i * 18]
, Dat_train_final [, 3 + i * 18]
, Dat_train_final [, 4 + i * 18]
, Dat_train_final [, 5 + i * 18]
, Dat_train_final [, 6 + i * 18]
, Dat_train_final [, 7 + i * 18]
, Dat_train_final [, 8 + i * 18]
, Dat_train_final [, 9 + i * 18]
, Dat_train_final [, 10 + i * 18]
, Dat_train_final [, 11 + i * 18]
, Dat_train_final [, 12 + i * 18]
, Dat_train_final [, 13 + i * 18]
, Dat_train_final [, 14 + i * 18]
, Dat_train_final [, 15 + i * 18]
, Dat_train_final [, 16 + i * 18]
, Dat_train_final [, 17 + i * 18]
, Dat_train_final [, 18, 18 + i *]))
plot_data $ group - c (rep (names (dat_train_final) [1 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [2 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [3 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [4 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [5 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [6 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [7 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [8 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [9 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [10 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [11 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [12 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [13 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [14 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [15 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [16 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [17 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [18 + i * 18], nrow (dat_train_final)))
plot_data $ points - c (seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final))
, Seq (1: nrow (dat_train_final)))
colnames (plot_data) - c ( 'values', 'lags', 'points')
ch - ggplot (plot_data, aes (x = points, y = values, colour = lags, group = lags)) +
geom_line (alpha = 0.3) +
ggtitle (chart_names [i + 1])
print (ch)
}
# density
for (i in 0: 6) {
plot_data - as.data.frame (c (dat_train_final [, 1 + i * 18]
, Dat_train_final [, 2 + i * 18]
, Dat_train_final [, 3 + i * 18]
, Dat_train_final [, 4 + i * 18]
, Dat_train_final [, 5 + i * 18]
, Dat_train_final [, 6 + i * 18]
, Dat_train_final [, 7 + i * 18]
, Dat_train_final [, 8 + i * 18]
, Dat_train_final [, 9 + i * 18]
, Dat_train_final [, 10 + i * 18]
, Dat_train_final [, 11 + i * 18]
, Dat_train_final [, 12 + i * 18]
, Dat_train_final [, 13 + i * 18]
, Dat_train_final [, 14 + i * 18]
, Dat_train_final [, 15 + i * 18]
, Dat_train_final [, 16 + i * 18]
, Dat_train_final [, 17 + i * 18]
, Dat_train_final [, 18, 18 + i *]))
plot_data $ group - c (rep (names (dat_train_final) [1 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [2 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [3 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [4 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [5 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [6 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [7 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [8 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [9 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [10 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [11 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [12 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [13 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [14 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [15 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [16 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [17 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [18 + i * 18], nrow (dat_train_final)))
colnames (plot_data) - c ( 'values', 'lags')
ch - ggplot (plot_data, aes (x = values, fill = lags)) +
geom_density (alpha = 0.3) +
ggtitle (paste ( 'density of', chart_names [i + 1]))
print (ch)
}
# boxplot
for (i in 0: 6) {
plot_data - as.data.frame (c (dat_train_final [, 1 + i * 18]
, Dat_train_final [, 2 + i * 18]
, Dat_train_final [, 3 + i * 18]
, Dat_train_final [, 4 + i * 18]
, Dat_train_final [, 5 + i * 18]
, Dat_train_final [, 6 + i * 18]
, Dat_train_final [, 7 + i * 18]
, Dat_train_final [, 8 + i * 18]
, Dat_train_final [, 9 + i * 18]
, Dat_train_final [, 10 + i * 18]
, Dat_train_final [, 11 + i * 18]
, Dat_train_final [, 12 + i * 18]
, Dat_train_final [, 13 + i * 18]
, Dat_train_final [, 14 + i * 18]
, Dat_train_final [, 15 + i * 18]
, Dat_train_final [, 16 + i * 18]
, Dat_train_final [, 17 + i * 18]
, Dat_train_final [, 18, 18 + i *]))
plot_data $ group - c (rep (names (dat_train_final) [1 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [2 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [3 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [4 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [5 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [6 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [7 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [8 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [9 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [10 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [11 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [12 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [13 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [14 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [15 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [16 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [17 + i * 18], nrow (dat_train_final))
, Rep (names (dat_train_final) [18 + i * 18], nrow (dat_train_final)))
plot_data $ group - as.factor (plot_data $ group)
colnames (plot_data) - c ( 'values', 'lags')
ch - ggplot (plot_data, aes (x = lags, y = values, fill = lags)) +
geom_boxplot (show_guide = FALSE) +
xlab ( 'lags') +
ylab ( 'differences') +
ggtitle (paste ( 'boxplot of', chart_names [i + 1]))
print (ch)
}
# Density comparison for train and test
distr_compare - data.frame ()
plot_list - list ()
for (i in 1: 126) {
plot_data - as.data.frame (c (dat_train_final [, i]
, Dat_test_final [, i]))
plot_data $ group - c (rep (paste ( 'train_', names (dat_train_final) [i], nrow (dat_train_final)))
, (Paste ( 'test_', names (dat_test_final) [i], nrow (dat_test_final))))
colnames (plot_data) - c ( 'values', 'lags')
pval - round (ks.test (dat_train_final [, i] + rnorm (n = nrow (dat_train_final), mean = 0, sd = 0.00001)
, Dat_test_final [, i] + rnorm (n = nrow (dat_test_final), mean = 0, sd = 0.00001)
, Alternative = 'two.sided') $ p.value, digits = 4)
distr_compare [i, 1] - names (dat_train_final) [i]
distr_compare [i, 2] - pval
plot_list [[i]] - ggplot (plot_data, aes (x = values, fill = lags)) +
geom_density (alpha = 0.3) +
ggtitle (paste ( 'density of train and test for variable', names (dat_train_final [i]), 'with distribution equality p-value =', pval))
}
colnames (distr_compare) - c ( 'variable_name', 'ks.tests p-value')
plot (distr_compare $ `ks.tests p-value`, type = 's')
plot_list [[1]]
Consider the details. We need to set ggplot2 package.
Let me remind you that we have formed an input vector of 108 variables and the output of the 18 variables.
All inputs are divided into 6 parts according to the logic of their formation:
1) The number of increments taken from different lag
2) the difference from the last price moving average
3) the difference of the last price peak with sliding
4) the last price difference with a sliding minimum
5) moving standard deviation
6) moving range data
Outputs the difference between the price of n steps forward and the last known price.
Steps (lags): 2, 3, 4, 6, 8, 11, 16, 23, 32, 35, 64, 91, 128, 181, 256, 362, 512, 724.
Let me remind you that this is 2 ^ (from 1 to 9.5 with step = 0.5).
Also recall that in training we took 2/3 of the history of each currency pair (ie the farthest past) and 1/3 of each pair left for validation. Check the model, so that will go to the future value of the price series.
First, look at a very informative but bulky squeeze on our data.
summary (dat_train_final)
lag_diff_2 lag_diff_3 lag_diff_4 lag_diff_6 lag_diff_8 lag_diff_11 lag_diff_16
Min. : -1.164e-02 Min. : -1.176e-02 Min. : -1.416e-02 Min. : -1.517e-02 Min. : -1.491e-02 Min. : -1.545e-02 Min. : -1.493e-02
1st Qu.:-2.000e-04 1st Qu.:-2.000e-04 1st Qu.:-2.000e-04 1st Qu.:-3.000e-04 1st Qu.:-3.000e-04 1st Qu.:- 4.000e-04 1st Qu.:-4.000e-04
Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00
Mean: -2.165e-06 Mean: -2.373e-06 Mean: -5.480e-07 Mean: -3.982e-06 Mean: -4.110e-06 Mean: 7.570e-07 Mean: 3.493e-06
3rd Qu .: 2.000e-04 3rd Qu .: 2.000e-04 3rd Qu .: 2.000e-04 3rd Qu .: 3.000e-04 3rd Qu .: 3.000e-04 3rd Qu .: 4.000e-04 3rd Qu .: 4.000e-04
Max. : 8.900e-03 Max. : 8.700e-03 Max. : 8.900e-03 Max. : 9.100e-03 Max. : 9.500e-03 Max. : 9.500e-03 Max. : 1.030e-02
lag_diff_23 lag_diff_32 lag_diff_45 lag_diff_64 lag_diff_91 lag_diff_128 lag_diff_181
Min. : -1.382e-02 Min. : -1.420e-02 Min. : -1.710e-02 Min. : -1.99e-02 Min. : -2.020e-02 Min. : -0.024000 Min. : -2.850e-02
1st Qu.:-5.000e-04 1st Qu.:-6.000e-04 1st Qu.:-7.000e-04 1st Qu.:-8.20e-04 1st Qu.:-1.000e-03 1st Qu.:- 0.001200 1st Qu.:-1.400e-03
Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.000e + 00 Median: 0.00e + 00 Median: 0.000e + 00 Median: 0.000000 Median: 0.000e + 00
Mean: 5.935e-06 Mean: 5.528e-06 Mean: -2.476e-06 Mean: -2.15e-07 Mean: 1.225e-05 Mean: 0.000008 Mean: 1.832e-05
3rd Qu .: 5.000e-04 3rd Qu .: 6.000e-04 3rd Qu .: 7.000e-04 3rd Qu .: 8.00e-04 3rd Qu .: 1.010e-03 3rd Qu .: 0.001200 3rd Qu .: 1.500 e-03
Max. : 1.470e-02 Max. : 1.560e-02 Max. : 2.430e-02 Max. : 2.60e-02 Max. : 3.290e-02 Max. : 0.034920 Max. : 3.588e-02
lag_diff_256 lag_diff_362 lag_diff_512 lag_diff_724 # etc.
summary function provides insight into the distribution of parameters and very always helpful.
We now turn to the graphical analysis
First I build for each block a set of variables such as graphs "line" - that is, in each graph will be 18 variables corresponding to the joists.
Here, for example, look the price difference (price [t] - price [t - k]).
And now look at the probability density of these variables.
Head over to the other side. Let's see boxplot - box plot for all those same variables.
Here it is interesting: the median (strip inside the rectangle) is greater than zero. This comparison of the latest prices with a sliding minimum. And it should be noted that the median value increases with increasing lag. That is, the deeper we take the minimum, the more we expect the value of the difference with this value.
This concise information about the values of standard deviations with different lags.
Now we will compare the data in the training set with the validation sample.
So compact, we iterate over all 126 speaker arrays and make a graphical comparison of densities, while conducting test for equality of distributions (the null hypothesis that they are equal).
I use the Kolmogorov-Smirnov's consent for the two samples: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
Note that when I conduct the test, I enter the normal noise in the data (sr.znach. = 0, sd. = 0.00001) in order to remove the overlapping values in both samples, as they displace the evaluation of the test statistics, but wherein the active noise introduces a systematic shift in the data (bias).
We get that for all variables distribution (cumulative) in the training and validation samples are quite different. p values are very close to zero, and we can not accept the null hypothesis of equality of distributions. And this is bad news, because the difference in the data makes all the time-dependent))).
We construct several plots with density.
That is all. We saw that the Forex data changes over time, and this time it difficult to build a model that will behave as we expect on the new data. Also, we just looked at from different angles on our data.
The next time the machine Loans Trainings and actually predicting price movements.
Till. Keep for updates.
Related posts
Oil stopped growing after data
Oil stopped growing after data from the EIA inventories Oil prices have stopped growing after data from the Energy Information Administration (EIA)...
Oil falls in price of data on
Oil falls in the price of data on the growth in US stocks Oil on world markets on Wednesday, falling in price after the publication of the report of the...
Data on US GDP in second quarter
Data on US GDP in the second quarter of the newly revised According to the latest, have final data, real growth US GDP in the second quarter exceeded...
Next posts