Raw Data Generation
bnode_core.data_generation.raw_data_generation
Raw data generation module for parallel FMU simulation.
Module Description
This module generates raw simulation data by running FMU (Functional Mock-up Unit) models in parallel with sampled inputs (initial states, parameters, controls). It uses Dask for distributed computing and writes results to HDF5 files with comprehensive logging.
Command-line Usage
With uv (recommended):
uv run raw_data_generation [overrides]
In activated virtual environment:
raw_data_generation [overrides]
Direct Python execution:
python -m bnode_core.data_generation.raw_data_generation [overrides]
Example Commands
# Generate 1000 samples with default config
uv run raw_data_generation pModel.RawData.n_samples=1000
# Use specific pModel config and allow overwriting
uv run raw_data_generation pModel=SHF overwrite=true
# Change control sampling strategy to RROCS
uv run raw_data_generation pModel.RawData.controls_sampling_strategy=RROCS
# Adjust parallel workers and timeout
uv run raw_data_generation multiprocessing_processes=8 pModel.RawData.Solver.timeout=120
# Adjust config path and name
uv run raw_data_generation --config-path=resources/config --config-name=data_generation_custom
What This Module Does
- Loads and validates configuration (FMU path, sampling strategies, solver settings)
- Sets reproducibility seed (np.random.seed(42))
- Creates HDF5 raw data file with pre-allocated datasets
- Samples input values (initial states, parameters, controls) using configured strategies
- Writes sampled inputs and metadata to HDF5 file
- Sets up Dask distributed cluster for parallel FMU simulation
- Submits simulation tasks in batches with timeout monitoring
- Incrementally writes simulation results (states, outputs, derivatives) to HDF5
- Logs completion status, failures, timeouts, and processing times per sample
- Saves configuration YAML file alongside raw data
See main() function for entry point and run_data_generation() for the complete pipeline.
Key Features
- Parallel execution using Dask LocalCluster with configurable workers
- Per-simulation timeout enforcement via ThreadPoolExecutor
- Automatic worker restart on repeated timeouts
- Incremental result writing (partial data available if interrupted)
- Comprehensive logging: completed, failed, timed-out simulations
- Multiple control sampling strategies (R, RO, ROCS, RROCS, RS, RF, file, Excel)
- Reproducible sampling (fixed seed since 2024-11-23)
- Dask dashboard for monitoring: http://localhost:8787
Sampling Strategies
Parameters: 'R' (random uniform)
Initial states: 'R' (random uniform)
Controls: 'R' (random uniform), 'RO' (random with offset), 'ROCS' (cubic splines with
clipping), 'RROCS' (cubic splines with random rescaling), 'RS' (random steps),
'RF' (frequency sweep), 'file' (from CSV), 'constantInput' (from Excel)
Configuration
Uses Hydra for configuration management. Config loaded from 'data_generation.yaml'.
Key config sections: pModel.RawData (all generation parameters including FMU path, bounds,
solver settings, sampling strategies), multiprocessing_processes (worker count),
memory_limit_per_worker (per-worker memory limit).
Output Files
- Raw data HDF5 file: Contains time, states, controls, outputs, parameters, logs
- Config YAML file: Snapshot of pModel.RawData configuration used for generation
Both file paths determined by bnode_core.filepaths functions.
random_sampling_parameters(cfg: data_gen_config) -> np.ndarray
Sample parameter values uniformly within configured bounds.
Generates a 2D array of parameter values by sampling uniformly from the bounds specified in cfg.pModel.RawData.parameters for each parameter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing parameter bounds and n_samples. cfg.pModel.RawData.parameters is a dict where each key maps to [lower_bound, upper_bound]. cfg.pModel.RawData.n_samples specifies the number of parameter sets to generate. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Parameter values with shape (n_samples, n_parameters). Each row is one sampled parameter set. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_sampling_controls(cfg: data_gen_config) -> np.ndarray
Sample control input values uniformly within configured bounds.
Generates a 3D array of control trajectories by sampling uniformly from the bounds specified in cfg.pModel.RawData.controls for each control variable at each timestep. Each control trajectory is independently sampled (no temporal correlation).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing control bounds, n_samples, and sequence_length. cfg.pModel.RawData.controls is a dict where each key maps to [lower_bound, upper_bound]. cfg.pModel.RawData.n_samples specifies the number of control trajectories to generate. cfg.pModel.RawData.Solver.sequence_length specifies the number of timesteps. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). Each element is independently sampled from uniform distributions. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_sampling_controls_w_offset(cfg: data_gen_config, seq_len: Optional[int] = None, n_samples: Optional[int] = None) -> np.ndarray
Sample control trajectories with random offset and bounded amplitude.
For each control trajectory, first samples a random offset within the control bounds, then samples an amplitude that ensures the trajectory stays within bounds. Each timestep is sampled uniformly within [offset - amplitude_lower, offset + amplitude_upper].
This produces control trajectories that vary around a central offset value rather than exploring the full control space independently at each timestep.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing control bounds. cfg.pModel.RawData.controls is a dict where each key maps to [lower_bound, upper_bound]. cfg.pModel.RawData.n_samples and cfg.pModel.RawData.Solver.sequence_length are used as defaults if n_samples or seq_len are not provided. |
required |
seq_len
|
Optional[int]
|
Optional sequence length override. If None, uses cfg.pModel.RawData.Solver.sequence_length. |
None
|
n_samples
|
Optional[int]
|
Optional sample count override. If None, uses cfg.pModel.RawData.n_samples. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, seq_len). Each trajectory varies around a sampled offset with bounded amplitude. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_sampling_controls_w_offset_cubic_splines_old_clip_manual(cfg: data_gen_config) -> np.ndarray
Sample control trajectories using cubic spline interpolation with manual clipping (ROCS).
Also known as ROCS (Random Offset Cubic Splines). Generates smooth control trajectories by:
- Sampling control values at random intervals
- Interpolating with cubic splines
- Normalizing to fit within bounds via manual clipping
ROCS fills the control space more than RROCS because values exceeding bounds are clipped to the bounds rather than rescaled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls_frequency_min_in_timesteps: minimum interval between samples. cfg.pModel.RawData.controls_frequency_max_in_timesteps: maximum interval between samples. cfg.pModel.RawData.controls: dict of control bounds [lower, upper]. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). Smooth trajectories that fill the control space via clipping. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_sampling_controls_w_offset_cubic_splines_clip_random(cfg: data_gen_config) -> np.ndarray
Sample control trajectories using cubic spline interpolation with random rescaling (RROCS).
Also known as RROCS (Randomly Rescaled Offset Cubic Splines). Generates smooth control trajectories by:
- For each control and sample, sampling values at random intervals (e.g. different frequencies), with sampled amplitudes and offsets
- Interpolating with cubic splines
- Normalizing to [0, 1] and rescaling with randomly sampled base and delta
- Optionally clipping to tighter bounds if specified
RROCS fills the control space less uniformly than ROCS because values are rescaled to fit within bounds rather than clipped. This means, that typically at the sampling bounds, less samples are present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls_frequency_min_in_timesteps: minimum interval between samples. cfg.pModel.RawData.controls_frequency_max_in_timesteps: maximum interval between samples. cfg.pModel.RawData.controls: dict where each key maps to [lower, upper] or [lower, upper, clip_lower, clip_upper] for optional tighter clipping bounds. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). Smooth trajectories with diverse amplitude and offset characteristics. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | |
random_steps_sampling_controls(cfg: data_gen_config) -> np.ndarray
Sample step-change control trajectories for system response testing.
Generates control trajectories with a single step change at the midpoint. Each control starts at a randomly sampled value and steps to another randomly sampled value halfway through the sequence. Useful for testing system step response characteristics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls: dict of control bounds [lower, upper]. cfg.pModel.RawData.n_samples: number of step trajectories to generate. cfg.pModel.RawData.Solver.sequence_length: total trajectory length. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). Each trajectory has a step change at sequence_length // 2. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_frequency_response_sampling_controls(cfg: data_gen_config) -> np.ndarray
Sample frequency-sweep control trajectories for system identification.
Generates control trajectories with a chirp (frequency sweep) starting at the midpoint. The first half is constant, and the second half contains a sine wave with linearly increasing frequency from min to max. Useful for system identification and frequency response analysis.
The frequency sweep goes from _min_frequency (low) to _max_frequency (high), calculated based on the configured control frequency bounds (multiplied by 4 since these represent half-periods).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls: dict of control bounds [lower, upper]. cfg.pModel.RawData.controls_frequency_min_in_timesteps: base for max sweep frequency. cfg.pModel.RawData.controls_frequency_max_in_timesteps: base for min sweep frequency. cfg.pModel.RawData.n_samples: number of trajectories to generate. cfg.pModel.RawData.Solver.sequence_length: total trajectory length. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). First half constant, second half contains frequency sweep. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
load_controls_from_file(cfg: data_gen_config) -> np.ndarray
Load control trajectories from a CSV file and resample to simulation time vector.
Reads control values from a CSV file where columns match control variable names from the config. The CSV must include a 'time' column. Control values are resampled via linear interpolation to match the simulation timestep, then replicated for all samples.
TODO: could be extended to load multiple trajectories for different samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls_file_path: path to CSV file with time and control columns. cfg.pModel.RawData.controls: dict of control names (used as column names). cfg.pModel.RawData.Solver: simulation time parameters (start, end, timestep). cfg.pModel.RawData.n_samples: number of times to replicate the loaded trajectory. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_samples, n_controls, sequence_length). Same trajectory replicated across all samples. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
constant_input_simulation_from_excel(cfg: data_gen_config) -> np.ndarray
Load constant control values from an Excel file for steady-state simulations.
Reads an Excel file with a sheet named 'Tabelle1' where each row defines one simulation with constant control values. Control columns must be named to match config control names. Each row's values are held constant for the entire sequence length.
Useful for steady-state simulations or parameter sweeps with constant inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration. cfg.pModel.RawData.controls_file_path: path to Excel file. cfg.pModel.RawData.controls: dict of control names (must match column names in Excel). cfg.pModel.RawData.Solver.sequence_length: length to replicate constant values. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Control values with shape (n_rows, n_controls, sequence_length). Each row from Excel becomes one sample with constant control values. |
Notes
Excel file structure: - Sheet name: 'Tabelle1' - First row: column headers matching control variable names - Each subsequent row: one set of constant control values for one simulation
Source code in src/bnode_core/data_generation/raw_data_generation.py
random_sampling_initial_states(cfg: data_gen_config) -> np.ndarray
Sample initial state values uniformly within configured bounds.
Generates a 2D array of initial state values by sampling uniformly from the bounds specified in cfg.pModel.RawData.states for each state variable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing state bounds and n_samples. cfg.pModel.RawData.states is a dict where each key maps to [lower_bound, upper_bound]. cfg.pModel.RawData.n_samples specifies the number of initial state sets to generate. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Initial state values with shape (n_samples, n_states). Each row is one sampled initial state vector. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
progress_string(progress: float, length: int = 10) -> str
Generate a visual progress bar string for logging.
Returns a visual progress string of the form '|||||.....' for a given progress value in [0, 1].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
progress
|
float
|
Progress value between 0 and 1. |
required |
length
|
int
|
Total length of the progress string. |
10
|
Returns:
| Type | Description |
|---|---|
str
|
Progress bar string with '|' for completed portion and '.' for remaining. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
data_generation(cfg: data_gen_config, initial_state_values: np.ndarray = None, param_values: np.ndarray = None, ctrl_values: np.ndarray = None)
Execute parallel FMU simulations and write results to raw data HDF5 file.
Core data generation function that:
- Sets up a Dask distributed cluster for parallel FMU simulation
- Submits simulation tasks for each sample in batches
- Monitors task completion and handles timeouts/failures
- Incrementally writes results to the raw data HDF5 file
- Logs completion status, failures, and timing information
The function uses ThreadPoolExecutor to enforce per-simulation timeouts and Dask's LocalCluster for parallel execution across multiple workers. Results are written incrementally so partial data is available even if generation is interrupted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing: - FMU path and simulation parameters - Solver settings (timestep, tolerance, timeout) - Multiprocessing and memory settings - Output file paths |
required |
initial_state_values
|
ndarray
|
Optional array of shape (n_samples, n_states) with initial states. |
None
|
param_values
|
ndarray
|
Optional array of shape (n_samples, n_parameters) with parameter values. |
None
|
ctrl_values
|
ndarray
|
Optional array of shape (n_samples, n_controls, sequence_length) with controls. |
None
|
Notes
- The raw data HDF5 file must already exist with pre-allocated datasets.
- Dask worker memory limits and allowed failures are configured from cfg settings.
- Progress is logged via the Dask diagnostic dashboard at http://localhost:8787.
- Per-sample logs (completed, sim_failed, timedout, processing_time) are written incrementally to the HDF5 file.
- If a worker's tasks timeout repeatedly, the worker is restarted automatically.
- For large numbers of samples, tasks are submitted in "submission rounds" (batches of 10,000 simulations) to avoid overwhelming the scheduler.
Raises:
| Type | Description |
|---|---|
BaseException
|
Any exception during generation is caught to ensure partial results are saved before re-raising. |
Source code in src/bnode_core/data_generation/raw_data_generation.py
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 | |
sample_all_values(cfg: data_gen_config) -> Tuple[Optional[np.ndarray], Optional[np.ndarray], Optional[np.ndarray]]
Sample all input values (initial states, parameters, controls) according to config.
Orchestrates sampling for all simulation inputs based on the configured sampling strategies. Returns None for any input category not included in the config. For parameters, if sampling is disabled, returns default parameter values for all samples.
Supported sampling strategies
- Initial states: 'R' (random uniform)
- Controls: 'R', 'RO' (random with offset), 'ROCS', 'RROCS', 'RS' (random steps), 'RF' (frequency response), 'file' (from CSV), 'constantInput' (from Excel)
- Parameters: 'R' (random uniform)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration containing:
|
required |
Returns:
| Type | Description |
|---|---|
Tuple[Optional[ndarray], Optional[ndarray], Optional[ndarray]]
|
Tuple of (initial_state_values, param_values, ctrl_values) where:
|
Source code in src/bnode_core/data_generation/raw_data_generation.py
770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 | |
run_data_generation(cfg: data_gen_config) -> None
Main orchestration function for raw data generation pipeline.
Complete raw data generation workflow:
- Convert and validate configuration
- Set reproducibility seed (np.random.seed(42))
- Create raw data HDF5 file with pre-allocated datasets
- Sample all input values (initial states, parameters, controls)
- Write sampled inputs and metadata to HDF5 file
- Execute parallel FMU simulations via data_generation()
- Save configuration as YAML file
The function prompts for confirmation before overwriting existing raw data files (unless cfg.overwrite is True). It creates the complete HDF5 structure including:
- Time vector and sampled inputs (initial_states, parameters, controls)
- Pre-allocated arrays for simulation outputs (states, states_der, outputs)
- Metadata attributes (creation_date, config YAML)
- Log datasets for tracking simulation status
This is the Hydra-decorated entry point called by main().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
data_gen_config
|
Data generation configuration (automatically populated by Hydra from YAML + CLI args). Key settings include:
|
required |
Notes
- Sets np.random.seed(42) for reproducibility (added 2024-11-23).
- Raw data HDF5 file path determined by filepath_raw_data(cfg).
- Config YAML path determined by filepath_raw_data_config(cfg).
- The HDF5 file config attribute stores OmegaConf.to_yaml(cfg.pModel.RawData).
- Creation date is recorded both in HDF5 attrs and in the config YAML.
Source code in src/bnode_core/data_generation/raw_data_generation.py
849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 | |
main()
CLI entry point for raw data generation.
Sets up Hydra configuration management and launches run_data_generation().
Hydra automatically:
- Loads the data_generation.yaml config from the auto-detected config directory
- Parses command-line overrides
- Creates a working directory for outputs
- Injects the composed config into run_data_generation()
Usage
python raw_data_generation.py [overrides]
Examples:
python raw_data_generation.py pModel.RawData.n_samples=1000
python raw_data_generation.py pModel=SHF overwrite=true