This is the antiCPy.trend_extrapolation package. It contains
the class CPSegmentFit (basic serial implementation),
the class BatchedCPSegmentFit (strongly parallelized version).
CPSegmentFit incorporates all attributes needed to implement the Bayesian non-parametric linear segment fitwhich takes into account possible change points (CPs). The basic procedure is described in [vdL14] [K14]and the nomenclature is chosen congruent to that. Each of the calculation steps is realizedby a class method of CP_segment_fit. You can follow the instructions of the citedpapers to interpret the coding. For example, the segment fit can be applied to drift slopeestimate \(\hat{\zeta}(t) \equiv y(x)\) time series computed with the antiCPy.early_warnings module.
The simple serial implementation CPSegmentFit can be rather time consuming. A first improvement is to used itsmultiprocessing option which computes each CP configuration in parallel with a predefined number of workers.Additionally, large amounts of CP configurations will without a doubt result in memory errors.The BatchedCPSegmentFit class solves these issues by parallel computation of batches of CP configurations whileeach worker only constructs a suitable subset of configurations. This leads to a major computation time improvement andavoids memory issues for a complicated CP segment fit with an arbitrary number of CPs.
- class antiCPy.trend_extrapolation.batched_cp_segment_fit.BatchedCPSegmentFit(x_data, y_data, number_expected_changepoints, num_MC_cp_samples, batchsize=5000, predict_up_to=None, z_array_size=100, print_batch_info=True, efficient_memory_management=False)[source]
Bases: CPSegmentFit
The
BatchedCPSegmentFit
is a child class ofCPSegmentFit
. It can be used to calculatethe change point configurations with the corresponding segment fit in a strongly parallelized batch-wise manner to avoidmemory errors and speed up computation times significantly in the case of high amount of data andchange point configurations.Important
In any case make sure that you use
import multiprocessing...if __name__ == '__main__': multiprocessing.set_start_method('spawn').
Windows should use the method
'spawn'
by default. But in general it depends on your system, so it might be betterto set the option always before using aBatchedSegmentFit
object. If you use a Linux distribution the method to createnew workers is usually fork. This will copy some features of the main process. Amongst others the neededlock
toavoid race conditions, might be copied and the new process will freeze. After longer runs this leads to all processesgetting frozen and killed after some time. You end up with incomplete tasks, but without error message.- static init_batch_execute(memory_connectorI, memory_connectorII, memory_connectorIII, memory_connectorIV, memory_management, multiprocessing)[source]
Internal static method to initialize the workers of a multiprocessing pool.
- static execute_batch(batch_num, print_batch_info, exact_sum_control, config_output, prepare_fit, total_batches, x, d, n_cp, batchsize, prediction_horizon, z_array, MC_cp_configurations, n_MC_samples, lock, first_round, second_round)[source]
Working order for the subprocesses. Creates a ``CPSegmentFit’’ object of the batch and calculates the corresponding CP pdfs.
- cp_scan(print_sum_control=False, integration_method='Riemann sum', print_batch_info=True, config_output=False, prepare_fit=False, multiprocessing=True, num_processes='half', print_CPU_count=False)[source]
Adapted method from ``CPSegmentFit.cp_scan(…)’’ method for strong parallelization and batch structure.
- fit(sigma_multiples=3, print_progress=True, print_batch_info=False, integration_method='Riemann sum', config_output=False, print_sum_control=True, multiprocessing=True, num_processes='half', print_CPU_count=False)[source]
Computes the segmental linear fit of the time series data with integrated change point assumptionsover the
z_array
which containsz_array_size
equidistant data points in the range from thefirst entry ofx
up to theprediction_horizon
. The fit results and corresponding variancesare saved in the attributesD_array
andDELTA_D2_array
, respectively.- Parameters:
sigma_multiples (float) – Specifies which multiple of standard deviations is chosen to determine the
upper_uncertainty_bound
and thelower_uncertainty_bound
. Default is 3.integration_method (str) – Determines the integration method to compute the change point probability.Default is
'Riemann sum'
for numerical integration with rectangles. Alternatively, the'Simpson rule'
can be chosen under the assumption of one change point.Sometimes the Simpson rule tends to be unstable. The method should be the same as theintegration method used incalculate_marginal_cp_pdf(...)
.print_batch_info (bool) – If
True
, computed to total batches are printed. Default is ``False’’.config_output (bool) – If
True
, the CP configurations of the current batch are printed. Default is ``False’’.print_sum_control – If
print_sum_control == True
it prints whether the exact
or the approximate MC sum is computed. Default is
False
.:type print_sum_control: bool- Parameters:
multiprocessing (bool) – If
True
, the batches are computed bynum_processes
workers in parallel. Default isTrue
.See AlsoCrandas Changelog - Roseman Labsnum_processes (str, int) – Default is
'half'
. Ifhalf
, almost half of the CPU kernels are used. If'all'
, all CPU kernelsare used. If integer number, the defined number of CPU kernels is used for multiprocessing.print_CPU_count (bool) – If
True
, the total number of available CPU kernels is printed. Default isFalse
.
- class antiCPy.trend_extrapolation.cp_segment_fit.CPSegmentFit(x_data, y_data, number_expected_changepoints, num_MC_cp_samples, predict_up_to=None, z_array_size=100)[source]
The
CP_segment_fit
class contains tools to perform a Bayesian segmental fit under the assumptionof a certain number of change points.- Parameters:
x_data (One-dimensional numpy array of floats) – Given data on the x-axis. Saved in attribute
x
.y_data (One-dimensional numpy array of floats) – Given data on the y-axis. Saved in attribute
y
.number_expected_changepoints (int) – Number of expected change points in the fit.
num_MC_cp_samples (int) – Maximum number of MC summands that shall be incorporated in order toextrapolate the fit. Saved in attribute
n_MC_samples
n_MC_samples (int) – Attribute contains the number of MC summands of the performed extrapolationof the fit. It is exact, whenever the number of possible change pointconfigurations is smaller than
num_MC_cp_samples
cp_prior_pdf (One-dimensional numpy array of floats) – Attribute that contains the flat prior probability of the considered change pointconfigurations.
num_cp_configs (int) – Attribute of the number of possible change point configurations.
exact_sum_control (bool) – If this attribute is
True
then the exact sum over all possible changepoint configurations will be computed in order to extrapolate the fit.If it is False, the given maximum numbernum_MC_cp_samples
of summands issmaller than the number of all possible change point configurations and the sumis performed as an approximative sum over num_MC_cp_samples randomly chosenchange point configurations.predict_up_to (float) – Defines the x-horizon of the extrapolation of the fit. Default is
None
,since it depends on the time scale of the given problem. It is saved in theattributeprediction_horizon
.d (One-dimensional numpy array of floats) – Attribute that contains the given
y_data
.x (One-dimensional numpy array of floats) – Attribute that contains the given
x_data
.A_matrix (Three-dimensional (
num_MC_cp_samples
,x_data.size
,number_expected_changepoints + 2
)numpy array of floats) – Attribute that contains the coefficients of the linear segments for the consideredchange point configurations.A_dim (One-dimensional numpy array of floats) – Contains the dimensions of the
A_matrix
.N (int) – Attribute that contains the data size of the input
x_data
andy_data
.n_cp (int) – Attribute that contains the
number_expected_changepoints
.MC_cp_configurations (Two-dimensional (
num_MC_cp_samples
,number_expected_changepoints + 2
)numpy array of floats) – Attribute that contains all possible change point configurations under the given assumptions and amount of data.f0 (Two-dimensional (
num_MC_cp_samples
,number_expected_changepoints + 2
) numpy array of floats) – Attribute that defines a matrix of mean design ordinates. Each row corresponds to a vectorof a specific configuration of change point positions.x_start (float) – Attribute contains the start value of
x_data
/x
.x_end (float) – Attribute contains the end value of
x_data
/x
.prediction_horizon (float) – Attribute in which the upper limit of the extrapolation x-horizon is saved.
Q_matrix (Three-dimensional (
num_MC_cp_samples
,number_expected_changepoints + 2
,number_expected_changepoints + 2
) numpy array of floats) – Attribute that contains the matrices \(Q=A^{T}A\) of the considered change pointconfigurations.Q_inverse (Three-dimensional (
num_MC_cp_samples
,number_expected_changepoints + 2
,number_expected_changepoints + 2
) numpy array of floats) – Attribute that contains the inverse Q_matrices of each considered change pointconfiguration.Res_E (One-dimensional (
num_MC_cp_samples
) numpy array of floats) – Attribute contains the residues \(R(E)=d^T d - \sum_k (u_k^Td)^2\) of each possiblechange point configuration \(E\).marginal_likelihood_pdf (One-dimensional (
num_MC_cp_samples
) numpy array of floats) – Attribute that contains the marginal likelihood of each change pointconfiguration.marginal_log_likelihood (One-dimensional (
num_MC_cp_samples
) numpy array of floats) – Attribute that contains the marginal natural logarithmic likelihoodof each change point configuration.marginal_cp_pdf (One-dimensional (
num_MC_cp_samples
) numpy array of floats) – Attribute that contains the normalized a posteriori probability of the computedchange point configurations. The normalization is valid for the grid ofx_data
.prob_cp (One-dimensional (
num_MC_cp_samples
) numpy array of floats) – Attribute that contains theprobability \(P(E \vert \underline{d}, \underline{x}, \mathcal{I})\) of a given changepoint configuration \(E\).D_array (One-dimensional numpy array of floats) – Attribute that contains the fitted values in the interval from the beginning of the time seriesup to
prediction_horizon
.DELTA_D2_array (One-dimensional numpy array of floats) – Attributes that contains the variances of the fitted values in
D_array
.transition_time (float) – Attribute which contains the time at which the extrapolated function crosses zero.
upper_uncertainty_bound (float) – Attribute which contains the time at which the upper uncertaintyboundary crosses zero.
lower_uncertainty_bound (float) – Attribute which contains the time at which the lower uncertaintyboundary crosses zero.
- initialize_MC_cp_configurations(print_sum_control=False, config_output=False)[source]
Defines the array
MC_cp_configurations
of all possible change point configurations including startand endx
if the exact sum is computed. Otherwise it creates an approximate set of random changepoint configurations based on the cited literature.- Parameters:
print_sum_control (bool) – If
print_sum_control == True
it prints whether the exactor the approximate MC sum is computed. Default isFalse
.config_output (bool) – If
True
the possible change point configurations withoutstart and end data point and the shape of the corresponding array are printed.Additionally, theMC_cp_configurations
attribute and its shape is printed. The attributeincludes the start and end values. Default isFalse
.
- initialize_A_matrices()[source]
Creates the A_matrices of the MC summands which correspond to possible change point configurations.
- Q_matrix_and_inverse_Q(save_Q_matrix=False)[source]
Computes the Q_matrices and the inverse of them for each MC summand which corresponds to a possiblechange point configuration.
- calculate_f0()[source]
Calculates
f0
as the mean \(f_0\) of the normal distribution that characterizes theprobability density function of the ordinate vectors \(f\).
- calculate_residue()[source]
Computes
Res_E
the residue \(R(E)\) of each MC summand.
- calculate_marginal_likelihood()[source]
Computes the
marginal_log_likelihood
as \(1/Z (R(E))^{(N-3)/2}\) andthe correspondingmarginal_likelihood
of each considered change point configuration.
- calculate_marginal_cp_pdf(integration_method='Riemann sum')[source]
Calculates the marginal posterior
marginal_cp_pdf
of each possible configuration ofchange point positions and normalizes the resulting probability density function.Therefore, the normalization constant is determined by integration of the resulting pdfvia the simpson rule.- Parameters:
integration_method (str) – Determines the integration method to compute the normalization.Default is
'Riemann sum'
for performing numerical integration via a sum of rectangleswith the sample width. Alternatively, the'Simpson rule'
can be chosen in the caseof one possible change point. Sometimes the Simpson rule tends to be unstable.The method should be the same as the integration method used incalculate_cp_prob(...)
.
- calculate_prob_cp(integration_method='Riemann sum')[source]
Calculates the probability
prob_cp
of each configuration of change point positions.- Parameters:
integration_method (str) – Determines the integration method to compute the change point probability.Default is
'Riemann sum'
for numerical integration with rectangles. Alternatively, the'Simpson rule'
can be chosen under the assumption of one change point.Sometimes the Simpson rule tends to be unstable. The method should be the same as theintegration method used incalculate_marginal_cp_pdf(...)
.
- predict_D_at_z(z)[source]
- Parameters:
z (float) – The x-data for which an extrapolated value
D
with varianceDELTA_D2
shall be calculated.- Returns:
The extrapolated y-data point
D
and its varianceDELTA_D2
for a given x-data pointz
.
- cp_scan(print_sum_control=False, integration_method='Riemann sum', config_output=False)[source]
Perform a change point scan on the dataset.
- Parameters:
print_sum_control (Boolean) – If print_sum_control = True it prints whether the exactor the approximate MC sum is computed. Default is False.
integration_method (str) – Determines the integration method to compute the change point probability.Default is
'Riemann sum'
for numerical integration with rectangles. Alternatively, the'Simpson rule'
can be chosen under the assumption of one change point.Sometimes the Simpson rule tends to be unstable. The method should be the same as theintegration method used incalculate_marginal_cp_pdf(...)
.
- fit(sigma_multiples=3, print_progress=True, integration_method='Riemann sum', config_output=False, print_sum_control=True)[source]
Computes the segmental linear fit of the time series data with integrated change point assumptionsover the
z_array
which containsz_array_size
equidistant data points in the range from thefirst entry ofx
up to theprediction_horizon
. The fit results and corresponding variancesare saved in the attributesD_array
andDELTA_D2_array
, respectively.- Parameters:
sigma_multiples – Specifies which multiple of standard deviations is chosen to determine the
upper_uncertainty_bound
and thelower_uncertainty_bound
. Default is 3.print_progress (bool) – If
True
the currently predicted data count is printed and updated successively.integration_method (str) – Determines the integration method to compute the change point probability.Default is
'Riemann sum'
for numerical integration with rectangles. Alternatively, the'Simpson rule'
can be chosen under the assumption of one change point.Sometimes the Simpson rule tends to be unstable. The method should be the same as theintegration method used incalculate_marginal_cp_pdf(...)
.
- static batched_compute_CP_pdfs(m, lock)[source]
Contains the working order to compute the marginal ordinal CP pdfs in a batchwise and parallelized manner.
- compute_CP_pdfs(multiprocessing=True, num_processes='half', print_CPU_count=False, print_progress=True)[source]
Computes the marginal ordinal CP pdfs and stores them in the attribute
self.CP_pdfs
.- Parameters:
multiprocessing (bool) – If
True
, the computations are parallelized onnum_process
workers. Default isTrue
.num_processes (str or int) – Default is
'half'
. The computations are parallelized on half of the available CPU kernels.If'all'
, all kernels are used. You can also choose a specific number of CPU kernelsfor parallelization, if you pass an integer number here.print_CPU_count (bool) – If
True
, the number of available CPU kernels on the machine is shown. Default isFalse
.print_progress (bool) – If
True
, the already computed batches to total batches are shown. Default isTrue`
.
- compute_expected_values_CP_positions()[source]
Computes the expected value of each ordinal CP position of the model. Implemented only if the CP configurationsare stored in
self.MC_cp_configurations
.
“batched_configs_helper” subpackage
The helper package enables the construction of CP configuration batches to avoid memory errors. It implementsmethods which are required for the memory efficient version of the parallelized CP analysis.
- antiCPy.trend_extrapolation.batched_configs_helper.create_configs_helper.construct_start_combinations_helper(data, total_data, tuple_num, pick_out_combination)[source]
Internal helper method to construct the first CP configuration of a certain batch. It assumes to draw the
pick_out_combination
from a total list of configurations in a systematic combinatoric order.
- antiCPy.trend_extrapolation.batched_configs_helper.create_configs_helper.extrapolate_batch_combinations(data, batch_size, tuple_num, pick_out_combination)[source]
Internal helper method to extrapolate the next
batch_size
CP configurations starting with the one constructed bythe helper methodconstruct_start_combinations_helper
in systematic order.
- antiCPy.trend_extrapolation.batched_configs_helper.create_configs_helper.batched_configs(batch_num, batch_size, x, prediction_horizon, n_cp, exact_sum_control=False, config_output=False)[source]
Internal helper method to initialize the CP configuration of a given batch.
Bibliography
[vdL14]
Linden, W., Dose, V., & Toussaint, U. (2014). Bayesian Probability Theory:Applications in the Physical Sciences. Cambridge: Cambridge University Press.doi:10.1017/CBO9781139565608
[K14]
A. Klöckner, F. van der Linden, and D. Zimmer, in Proceedings of the 10th InternationalModelica Conference, March 10-12, 2014, Lund, Sweden (Linköping University Electronic Press, 2014)