Data.dataset(数据集) 模块¶
ppsci.data.dataset
¶
CGCNNDataset
¶
Bases: Dataset
The CIFData dataset is a wrapper for a dataset where the crystal structures are stored in the form of CIF files. The dataset should have the following directory structure:
root_dir ├── id_prop.csv ├── atom_init.json ├── id0.cif ├── id1.cif ├── ...
id_prop.csv: a CSV file with two columns. The first column recodes a unique ID for each crystal, and the second column recodes the value of target property.
atom_init.json: a JSON file that stores the initialization vector for each element.
ID.cif: a CIF file that recodes the crystal structure, where ID is the unique ID for the crystal.
Args root_dir (str): The path to the root directory of the dataset max_num_nbr (int): The maximum number of neighbors while constructing the crystal graph radius (float): The cutoff radius for searching neighbors dmin (float): The minimum distance for constructing GaussianDistance step (float): The step size for constructing GaussianDistance random_seed (int): Random seed for shuffling the dataset
Returns atom_fea (paddle.Tensor): Shape (n_i, atom_fea_len) nbr_fea (paddle.Tensor): Shape (n_i, M, nbr_fea_len) nbr_fea_idx (paddle.Tensor): Shape (n_i, M) target (paddle.Tensor): Shape (1, ) cif_id (str or int)
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.CGCNNDataset(
... "file_path": "/path/to/CGCNNDataset",
... "input_keys": "i",
... "label_keys": "l",
... "id_keys": "c",
... )
Source code in ppsci/data/dataset/cgcnn_dataset.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
ChipHeatDataset
¶
Bases: Dataset
ChipHeatDataset for data loading of multi-branch DeepONet model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Dict[str, ndarray]
|
Input dict. |
required |
label
|
Optional[Dict[str, ndarray]]
|
Label dict. Defaults to None. |
required |
index
|
tuple[str, ...]
|
Key of input dict. |
required |
data_type
|
str
|
One of key of input dict. |
required |
weight
|
Optional[Dict[str, ndarray]]
|
Weight dict. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> input = {"x": np.random.randn(100, 1)}
>>> label = {"u": np.random.randn(100, 1)}
>>> index = ('x', 'u', 'bc', 'bc_data')
>>> data_type = 'u'
>>> weight = {"u": np.random.randn(100, 1)}
>>> dataset = ppsci.data.dataset.ChipHeatDataset(input, label, index, data_type, weight)
Source code in ppsci/data/dataset/array_dataset.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 | |
ContinuousNamedArrayDataset
¶
Bases: IterableDataset
ContinuousNamedArrayDataset for iterable sampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Callable
|
Function generate input dict. |
required |
label
|
Callable
|
Function generate label dict. |
required |
weight
|
Optional[Callable]
|
Function generate weight dict. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> import numpy as np
>>> input = lambda : {"x": np.random.randn(100, 1)}
>>> label = lambda inp: {"u": np.random.randn(100, 1)}
>>> weight = lambda inp, label: {"u": 1 - (label["u"] ** 2)}
>>> dataset = ppsci.data.dataset.ContinuousNamedArrayDataset(input, label, weight)
>>> input_batch, label_batch, weight_batch = next(iter(dataset))
>>> print(input_batch["x"].shape)
[100, 1]
>>> print(label_batch["u"].shape)
[100, 1]
>>> print(weight_batch["u"].shape)
[100, 1]
Source code in ppsci/data/dataset/array_dataset.py
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
num_samples
property
¶
Number of samples within current dataset.
CSVDataset
¶
Bases: Dataset
Dataset class for .csv file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
CSV file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. |
required |
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. i.e. {inner_key: outer_key}. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.CSVDataset(
... "/path/to/file.csv",
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/csv_dataset.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
CylinderDataset
¶
Bases: Dataset
Dataset for training Cylinder model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Data set path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("states","visc"). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("pred_states", "recover_states"). |
required |
block_size
|
int
|
Data block size. |
required |
stride
|
int
|
Data stride. |
required |
ndata
|
Optional[int]
|
Number of data series to use. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
embedding_model
|
Optional[Arch]
|
Embedding model. Defaults to None. |
None
|
embedding_batch_size
|
int
|
The batch size of embedding model. Defaults to 64. |
64
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.CylinderDataset(
... "file_path": "/path/to/CylinderDataset",
... "input_keys": ("x",),
... "label_keys": ("v",),
... "block_size": 32,
... "stride": 16,
... )
Source code in ppsci/data/dataset/trphysx_dataset.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | |
DarcyFlowDataset
¶
Bases: Dataset
Loads a small Darcy-Flow dataset
Training contains 1000 samples in resolution 16x16. Testing contains 100 samples at resolution 16x16 and 50 samples at resolution 32x32.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
data_dir
|
str
|
The directory to load data from. |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
test_resolutions
|
Tuple[int, ...]
|
The resolutions to test dataset. Default is [16, 32]. |
[32]
|
grid_boundaries
|
Tuple[int, ...]
|
The boundaries of the grid. Default is [[0,1],[0,1]]. |
[[0, 1], [0, 1]]
|
positional_encoding
|
bool
|
Whether to use positional encoding. Default is True |
True
|
encode_input
|
bool
|
Whether to encode the input. Default is False |
False
|
encode_output
|
bool
|
Whether to encode the output. Default is True |
True
|
encoding
|
str
|
The type of encoding. Default is 'channel-wise'. |
'channel-wise'
|
channel_dim
|
int
|
The location of unsqueeze. Default is 1. where to put the channel dimension. Defaults size is batch, channel, height, width |
1
|
data_split
|
str
|
Wether to use training or test dataset. Default is 'train'. |
'train'
|
Source code in ppsci/data/dataset/darcyflow_dataset.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | |
DGMRDataset
¶
Bases: Dataset
Dataset class for DGMR (Deep Generative Model for Radar) model. This open-sourced UK dataset has been mirrored to HuggingFace Datasets https://huggingface.co/datasets/openclimatefix/nimrod-uk-1km. If the reader cannot load the dataset from Hugging Face, please manually download it and modify the dataset_path to the local path for loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
split
|
str
|
The split of the dataset, "validation" or "train". Defaults to "validation". |
'validation'
|
num_input_frames
|
int
|
Number of input frames. Defaults to 4. |
4
|
num_target_frames
|
int
|
Number of target frames. Defaults to 18. |
18
|
dataset_path
|
str
|
Path to the dataset. Defaults to "openclimatefix/nimrod-uk-1km". |
'openclimatefix/nimrod-uk-1km'
|
Examples:
Source code in ppsci/data/dataset/dgmr_dataset.py
DrivAerNetDataset
¶
Bases: Dataset
Paddle Dataset class for the DrivAerNet dataset, handling loading, transforming, and augmenting 3D car models.
This dataset is specifically designed for aerodynamic tasks, including training machine learning models to predict aerodynamic coefficients such as drag coefficient (Cd) from 3D car models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Tuple specifying the keys for input features. These keys correspond to the attributes of the dataset used as input to the model. For example, "vertices" represents the 3D point cloud vertices of car models. |
required |
label_keys
|
Tuple[str, ...]
|
Tuple specifying the keys for ground-truth labels. These keys correspond to the target values, such as aerodynamic coefficients like Cd. Example: ("cd_value",) |
required |
weight_keys
|
Tuple[str, ...]
|
Tuple specifying the keys for optional sample weights. These keys represent weighting factors that may be used to adjust loss computation during model training. Useful for handling sample imbalance. Example: ("weight_keys",) |
required |
subset_dir
|
str
|
Path to the directory containing subset information. This directory typically contains files that divide the dataset into training, validation, and test subsets using a list of model IDs. |
required |
ids_file
|
str
|
Path to the text file containing model IDs for the current subset. Each line in the file corresponds to a unique model ID that defines which models belong to the subset (e.g., training set or test set). |
required |
root_dir
|
str
|
Directory containing the STL files of 3D car models. Each STL file is expected to represent a single car model and is named according to the corresponding model ID. This is the primary data source. |
required |
csv_file
|
str
|
Path to the CSV file containing metadata for car models. This file typically includes aerodynamic properties (e.g., drag coefficient) and other descriptive attributes mapped to each model ID. |
required |
num_points
|
int
|
Fixed number of points to sample from each 3D model.
If a 3D model has more points than |
required |
transform
|
Optional[Callable]
|
An optional callable for applying data transformations. This can include augmentations such as scaling, rotation, jittering, or other preprocessing steps applied to the 3D point clouds before they are passed to the model. |
None
|
pointcloud_exist
|
bool
|
Whether the point clouds are pre-processed and saved as |
True
|
train_fractions
|
float
|
Fraction of the training data to use. Useful for experiments where only a portion of the data is needed. |
1.0
|
mode
|
str
|
Mode of operation, either "train", "eval", or "test". Determines how the dataset behaves. |
'eval'
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.DrivAerNetDataset(
... input_keys=("vertices",),
... label_keys=("cd_value",),
... weight_keys=("weight_keys",),
... subset_dir="/path/to/subset_dir",
... ids_file="train_ids.txt",
... root_dir="/path/to/DrivAerNetDataset",
... csv_file="/path/to/aero_metadata.csv",
... num_points=1024,
... transform=None,
... )
Source code in ppsci/data/dataset/drivaernet_dataset.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | |
__getitem__(idx, apply_augmentations=True)
¶
Retrieves a sample and its corresponding label from the dataset, with an option to apply augmentations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the sample to retrieve. |
required |
apply_augmentations
|
bool
|
Whether to apply data augmentations. Defaults to True. |
True
|
Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Dict[str, np.ndarray]]:
A tuple containing three dictionaries:
- The first dictionary contains the input data (point cloud) under the key specified by self.input_keys[0].
- The second dictionary contains the label (Cd value) under the key specified by self.label_keys[0].
- The third dictionary contains the weight (default is 1) under the key specified by self.weight_keys[0].
Source code in ppsci/data/dataset/drivaernet_dataset.py
DrivAerNetPlusPlusDataset
¶
Bases: Dataset
Paddle Dataset class for the DrivAerNet dataset, handling loading, transforming, and augmenting 3D car models.
This dataset is designed for tasks involving aerodynamic simulations and deep learning models, specifically for predicting aerodynamic coefficients (e.g., Cd values) from 3D car models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Tuple of strings specifying the input keys. These keys correspond to the features extracted from the dataset, typically the 3D vertices of car models. Example: ("vertices",) |
required |
label_keys
|
Tuple[str, ...]
|
Tuple of strings specifying the label keys. These keys correspond to the ground-truth labels, such as aerodynamic coefficients (e.g., Cd values). Example: ("cd_value",) |
required |
weight_keys
|
Tuple[str, ...]
|
Tuple of strings specifying the weight keys. These keys represent optional weighting factors used during model training to handle class imbalance or sample importance. Example: ("weight_keys",) |
required |
subset_dir
|
str
|
Path to the directory containing subsets of the dataset. This directory is used to divide the dataset into different subsets (e.g., train, validation, test) based on provided IDs. |
required |
ids_file
|
str
|
Path to the file containing the list of IDs for the subset. The file specifies which models belong to the current subset (e.g., training IDs). |
required |
root_dir
|
str
|
Root directory containing the 3D STL files of car models. Each 3D model is expected to be stored in a file named according to its ID. |
required |
csv_file
|
str
|
Path to the CSV file containing metadata for the car models. The CSV file includes information such as aerodynamic coefficients, and may also map model IDs to specific attributes. |
required |
num_points
|
int
|
Number of points to sample or pad each 3D point cloud to.
If the model has more points than |
required |
transform
|
Optional[Callable]
|
Optional transformation function applied to each sample. This can include augmentations like scaling, rotation, or jittering. |
None
|
pointcloud_exist
|
bool
|
Whether the point clouds are pre-processed and saved as |
True
|
Examples:
import ppsci dataset = ppsci.data.dataset.DrivAerNetPlusPlusDataset( ... input_keys=("vertices",), ... label_keys=("cd_value",), ... weight_keys=("weight_keys",), ... subset_dir="/path/to/subset_dir", ... ids_file="train_ids.txt", ... root_dir="/path/to/DrivAerNetPlusPlusDataset", ... csv_file="/path/to/aero_metadata.csv", ... num_points=1024, ... transform=None, ... ) # doctest: +SKIP
Source code in ppsci/data/dataset/drivaernetplusplus_dataset.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
__getitem__(idx, apply_augmentations=True)
¶
Retrieves a sample and its corresponding label from the dataset, with an option to apply augmentations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the sample to retrieve. |
required |
apply_augmentations
|
bool
|
Whether to apply data augmentations. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, ndarray], Dict[str, ndarray], Dict[str, ndarray]]
|
Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Dict[str, np.ndarray]]:
A tuple containing three dictionaries:
- The first dictionary contains the input data (point cloud) under the key specified by |
Source code in ppsci/data/dataset/drivaernetplusplus_dataset.py
__len__()
¶
min_max_normalize(data)
¶
Normalizes the data to the range [0, 1] based on min and max values.
Source code in ppsci/data/dataset/drivaernetplusplus_dataset.py
ERA5ClimateDataset
¶
Bases: Dataset
ERA5 dataset for multi-meteorological-element climate prediction (r, t, u, v).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Dataset path (contains .npy files in year folders). |
required |
input_keys
|
Tuple[str, ...]
|
Input dict keys, e.g. ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Label dict keys, e.g. ("output",). |
required |
size
|
Tuple[int, int]
|
Crop size (height, width). |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Optional transforms. Defaults to None. |
None
|
training
|
bool
|
If in training mode (2016-2018). Else validation mode (2019). |
True
|
stride
|
int
|
Stride for sampling. Defaults to 1. |
1
|
sq_length
|
int
|
Sequence length for input and output. Defaults to 6. |
6
|
years
|
Optional[List[str]]
|
List of years to load. Defaults to None (use default years). |
None
|
Source code in ppsci/data/dataset/era5climate_dataset.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |
ERA5Dataset
¶
Bases: Dataset
Class for ERA5 dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Data set path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
precip_file_path
|
Optional[str]
|
Precipitation data set path. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
vars_channel
|
Optional[Tuple[int, ...]]
|
The variable channel index in ERA5 dataset. Defaults to None. |
None
|
num_label_timestamps
|
int
|
Number of timestamp of label. Defaults to 1. |
1
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
training
|
bool
|
Whether in train mode. Defaults to True. |
True
|
stride
|
int
|
Stride of sampling data. Defaults to 1. |
1
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.ERA5Dataset(
... "file_path": "/path/to/ERA5Dataset",
... "input_keys": ("input",),
... "label_keys": ("output",),
... )
Source code in ppsci/data/dataset/era5_dataset.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
ERA5MeteoDataset
¶
Bases: Dataset
ERA5 dataset for multi-meteorological-element prediction (r, t, u, v).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Dataset path (contains .npy files in year folders). |
required |
input_keys
|
Tuple[str, ...]
|
Input dict keys, e.g. ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Label dict keys, e.g. ("output",). |
required |
size
|
Tuple[int, int]
|
Crop size (height, width). |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Optional transforms. Defaults to None. |
None
|
training
|
bool
|
If in training mode (2016-2018). Else validation mode (2019). |
True
|
stride
|
int
|
Stride for sampling. Defaults to 1. |
1
|
sq_length
|
int
|
Sequence length for input and output. Defaults to 6. |
6
|
Source code in ppsci/data/dataset/era5meteo_dataset.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
load_data(indices)
¶
Load r, t, u, v for a given index.
Source code in ppsci/data/dataset/era5meteo_dataset.py
ERA5SampledDataset
¶
Bases: Dataset
Class for ERA5 sampled dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Data set path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.ERA5SampledDataset(
... "file_path": "/path/to/ERA5SampledDataset",
... "input_keys": ("input",),
... "label_keys": ("output",),
... )
>>> # get the length of the dataset
>>> dataset_size = len(dataset)
>>> # get the first sample of the data
>>> first_sample = dataset[0]
>>> print("First sample:", first_sample)
Source code in ppsci/data/dataset/era5_dataset.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
ERA5SQDataset
¶
Bases: Dataset
Class for ERA5 dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Dataset path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
training
|
bool
|
Whether in train mode. Defaults to True. |
True
|
sq_length
|
int
|
Length of sequence for time series data. Defaults to 6. |
6
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.ERA5SQDataset(
... "file_path": "/path/to/ERA5SQDataset",
... "input_keys": ("input",),
... "label_keys": ("output",),
... )
Source code in ppsci/data/dataset/era5sq_dataset.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
ExtMoEENSODataset
¶
Bases: Dataset
The El Niño/Southern Oscillation dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Name of input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Name of label keys, such as ("output",). |
required |
data_dir
|
str
|
The directory of data. |
required |
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
in_len
|
int
|
The length of input data. Defaults to 12. |
12
|
out_len
|
int
|
The length of out data. Defaults to 26. |
26
|
in_stride
|
int
|
The stride of input data. Defaults to 1. |
1
|
out_stride
|
int
|
The stride of output data. Defaults to 1. |
1
|
train_samples_gap
|
int
|
The stride of sequence sampling during training. Defaults to 10. e.g., samples_gap = 10, the first seq contains [0, 1, ..., T-1] frame indices, the second seq contains [10, 11, .., T+9] |
10
|
eval_samples_gap
|
int
|
The stride of sequence sampling during eval. Defaults to 11. |
11
|
normalize_sst
|
bool
|
Whether to use normalization. Defaults to True. |
True
|
batch_size
|
int
|
Batch size. Defaults to 1. |
1
|
num_workers
|
int
|
The num of workers. Defaults to 1. |
1
|
training
|
str
|
Training pathse. Defaults to "train". |
'train'
|
Source code in ppsci/data/dataset/ext_moe_enso_dataset.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 | |
IFMMoeDataset
¶
Bases: Dataset
Dataset for IFMMoe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Name of input data. |
required |
label_keys
|
Tuple[str, ...]
|
Name of label data. |
required |
data_dir
|
str
|
Directory of IFMMoe data. |
required |
data_label
|
str
|
IFMMoe data label in tox21/esol/freesolv/lipop... |
required |
data_mode
|
str
|
train/val/test mode data. |
required |
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.IFMMoeDataset(
... "input_keys": ("input",),
... "label_keys": ("output",),
... "data_dir": "/path/to/IFMMoeDataset",
... "data_label": "tox21",
... "data_mode": "train",
... )
Source code in ppsci/data/dataset/ifm_moe_dataset.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 | |
IterableCSVDataset
¶
Bases: IterableDataset
IterableCSVDataset for full-data loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
CSV file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. |
required |
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.IterableCSVDataset(
... "/path/to/file.csv"
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/csv_dataset.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
num_samples
property
¶
Number of samples within current dataset.
IterableMatDataset
¶
Bases: IterableDataset
IterableMatDataset for full-data loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Mat file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. Defaults to (). |
()
|
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. i.e. {inner_key: outer_key}. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.IterableMatDataset(
... "/path/to/file.mat"
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/mat_dataset.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
num_samples
property
¶
Number of samples within current dataset.
IterableNamedArrayDataset
¶
Bases: IterableDataset
IterableNamedArrayDataset for full-data loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Dict[str, ndarray]
|
Input dict. |
required |
label
|
Optional[Dict[str, ndarray]]
|
Label dict. Defaults to None. |
None
|
weight
|
Optional[Dict[str, ndarray]]
|
Weight dict. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> input = {"x": np.random.randn(100, 1)}
>>> label = {"u": np.random.randn(100, 1)}
>>> weight = {"u": np.random.randn(100, 1)}
>>> dataset = ppsci.data.dataset.IterableNamedArrayDataset(input, label, weight)
Source code in ppsci/data/dataset/array_dataset.py
num_samples
property
¶
Number of samples within current dataset.
IterableNPZDataset
¶
Bases: IterableDataset
IterableNPZDataset for full-data loading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Npz file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. Defaults to (). |
()
|
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. i.e. {inner_key: outer_key}. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.IterableNPZDataset(
... "/path/to/file.npz"
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/npz_dataset.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 | |
num_samples
property
¶
Number of samples within current dataset.
LorenzDataset
¶
Bases: Dataset
Dataset for training Lorenz model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Data set path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("states",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("pred_states", "recover_states"). |
required |
block_size
|
int
|
Data block size. |
required |
stride
|
int
|
Data stride. |
required |
ndata
|
Optional[int]
|
Number of data series to use. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
embedding_model
|
Optional[Arch]
|
Embedding model. Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.LorenzDataset(
... "file_path": "/path/to/LorenzDataset",
... "input_keys": ("x",),
... "label_keys": ("v",),
... "block_size": 32,
... "stride": 16,
... )
Source code in ppsci/data/dataset/trphysx_dataset.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
MatDataset
¶
Bases: Dataset
Dataset class for .mat file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Mat file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. Defaults to (). |
()
|
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. i.e. {inner_key: outer_key}. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.MatDataset(
... "/path/to/file.mat"
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/mat_dataset.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
MeshAirfoilDataset
¶
Bases: Dataset
Dataset for MeshAirfoil.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Name of input data. |
required |
label_keys
|
Tuple[str, ...]
|
Name of label data. |
required |
data_dir
|
str
|
Directory of MeshAirfoil data. |
required |
mesh_graph_path
|
str
|
Path of mesh graph. |
required |
transpose_edges
|
bool
|
Whether transpose the edges array from (2, num_edges) to (num_edges, 2) for convenient of slicing. |
False
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.MeshAirfoilDataset(
... "input_keys": ("input",),
... "label_keys": ("output",),
... "data_dir": "/path/to/MeshAirfoilDataset",
... "mesh_graph_path": "/path/to/file.su2",
... "transpose_edges": False,
... )
Source code in ppsci/data/dataset/airfoil_dataset.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | |
MeshCylinderDataset
¶
Bases: Dataset
Dataset for MeshCylinder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Name of input data. |
required |
label_keys
|
Tuple[str, ...]
|
Name of label data. |
required |
data_dir
|
str
|
Directory of MeshCylinder data. |
required |
mesh_graph_path
|
str
|
Path of mesh graph. |
required |
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.MeshAirfoilDataset(
... "input_keys": ("input",),
... "label_keys": ("output",),
... "data_dir": "/path/to/MeshAirfoilDataset",
... "mesh_graph_path": "/path/to/file.su2",
... )
Source code in ppsci/data/dataset/cylinder_dataset.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | |
MoleculeDatasetIter
¶
Bases: IterableDataset
Source code in ppsci/data/dataset/synthemol_dataset.py
1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 | |
MOlFLOWDataset
¶
Bases: Dataset
Class for moflow qm9 and zinc250k Dataset of a tuple of datasets.
It combines multiple datasets into one dataset. Each example is represented
by a tuple whose i-th item corresponds to the i-th dataset.
And each i-th dataset is expected to be an instance of numpy.ndarray.
Args: file_path (str): Data set path. data_name (str): Data name, "qm9" or "zinc250k" valid_idx (List[int]): Data for validate mode (str): "train" or "eval", output Data input_keys (Tuple[str, ...]): Input keys, such as ("nodes","edges",). label_keys (Tuple[str, ...]): labels (str or list or None) . smiles_col (str): smiles column weight_dict (Optional[Dict[str, Union[Callable, float]]]): Define the weight of each constraint variable. Defaults to None. transform_fn: An optional function applied to an item bofre returning
Source code in ppsci/data/dataset/moflow_dataset.py
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | |
load_csv_file(path, name)
¶
Parse DataFrame using MolGraph and prepare a dataset instance
Labels are extracted from labels columns and input features are
extracted from smiles information in smiles column.
Source code in ppsci/data/dataset/moflow_dataset.py
NamedArrayDataset
¶
Bases: Dataset
Class for Named Array Dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Dict[str, ndarray]
|
Input dict. |
required |
label
|
Optional[Dict[str, ndarray]]
|
Label dict. Defaults to None. |
None
|
weight
|
Optional[Dict[str, ndarray]]
|
Weight dict. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> input = {"x": np.random.randn(100, 1)}
>>> output = {"u": np.random.randn(100, 1)}
>>> weight = {"u": np.random.randn(100, 1)}
>>> dataset = ppsci.data.dataset.NamedArrayDataset(input, output, weight)
Source code in ppsci/data/dataset/array_dataset.py
NPZDataset
¶
Bases: Dataset
Dataset class for .npz file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Npz file path. |
required |
input_keys
|
Tuple[str, ...]
|
List of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
List of label keys. Defaults to (). |
()
|
alias_dict
|
Optional[Dict[str, str]]
|
Dict of alias(es) for input and label keys. i.e. {inner_key: outer_key}. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
timestamps
|
Optional[Tuple[float, ...]]
|
The number of repetitions of the data in the time dimension. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.NPZDataset(
... "/path/to/file.npz"
... ("x",),
... ("u",),
... )
Source code in ppsci/data/dataset/npz_dataset.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
PEMSDataset
¶
Bases: Dataset
Dataset class for PEMSD4 and PEMSD8 dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Dataset root path. |
required |
split
|
str
|
Dataset split label. |
required |
input_keys
|
Tuple[str, ...]
|
A tuple of input keys. |
required |
label_keys
|
Tuple[str, ...]
|
A tuple of label keys. |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
transforms
|
Optional[Compose]
|
Compose object contains sample wise transform(s). Defaults to None. |
None
|
norm_input
|
bool
|
Whether to normalize the input. Defaults to True. |
True
|
norm_label
|
bool
|
Whether to normalize the output. Defaults to False. |
False
|
input_len
|
int
|
The input timesteps. Defaults to 12. |
12
|
label_len
|
int
|
The output timesteps. Defaults to 12. |
12
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.PEMSDataset(
... "./Data/PEMSD4",
... "train",
... ("input",),
... ("label",),
... )
Source code in ppsci/data/dataset/pems_dataset.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
RadarDataset
¶
Bases: Dataset
Class for Radar dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
image_width
|
int
|
Image width. |
required |
image_height
|
int
|
Image height. |
required |
total_length
|
int
|
Total length. |
required |
dataset_path
|
str
|
Dataset path. |
required |
data_type
|
str
|
Input and output data type. Defaults to paddle.get_default_dtype(). |
get_default_dtype()
|
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.RadarDataset(
... "input_keys": ("input",),
... "label_keys": ("output",),
... "image_width": 512,
... "image_height": 512,
... "total_length": 29,
... "dataset_path": "datasets/mrms/figure",
... "data_type": paddle.get_default_dtype(),
... )
Source code in ppsci/data/dataset/radar_dataset.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
RosslerDataset
¶
Bases: LorenzDataset
Dataset for training Rossler model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Data set path. |
required |
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("states",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("pred_states", "recover_states"). |
required |
block_size
|
int
|
Data block size. |
required |
stride
|
int
|
Data stride. |
required |
ndata
|
Optional[int]
|
Number of data series to use. Defaults to None. |
None
|
weight_dict
|
Optional[Dict[str, float]]
|
Weight dictionary. Defaults to None. |
None
|
embedding_model
|
Optional[Arch]
|
Embedding model. Defaults to None. |
None
|
Examples:
>>> import ppsci
>>> dataset = ppsci.data.dataset.RosslerDataset(
... "file_path": "/path/to/RosslerDataset",
... "input_keys": ("x",),
... "label_keys": ("v",),
... "block_size": 32,
... "stride": 16,
... )
Source code in ppsci/data/dataset/trphysx_dataset.py
SEVIRDataset
¶
Bases: Dataset
The Storm EVent ImagRy dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Name of input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Name of label keys, such as ("output",). |
required |
data_dir
|
str
|
The path of the dataset. |
required |
weight_dict
|
Optional[Dict[str, Union[Callable, float]]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
data_types
|
Sequence[str]
|
A subset of SEVIR_DATA_TYPES. Defaults to [ "vil", ]. |
['vil']
|
seq_len
|
int
|
The length of the data sequences. Should be smaller than the max length raw_seq_len. Defaults to 49. |
49
|
raw_seq_len
|
int
|
The length of the raw data sequences. Defaults to 49. |
49
|
sample_mode
|
str
|
The mode of sampling, eg.'random' or 'sequent'. Defaults to "sequent". |
'sequent'
|
stride
|
int
|
Useful when sample_mode == 'sequent' stride must not be smaller than out_len to prevent data leakage in testing. Defaults to 12. |
12
|
batch_size
|
int
|
The batch size. Defaults to 1. |
1
|
layout
|
str
|
Consists of batch_size 'N', seq_len 'T', channel 'C', height 'H', width 'W' The layout of sampled data. Raw data layout is 'NHWT'. valid layout: 'NHWT', 'NTHW', 'NTCHW', 'TNHW', 'TNCHW'. Defaults to "NHWT". |
'NHWT'
|
in_len
|
int
|
The length of input data. Defaults to 13. |
13
|
out_len
|
int
|
The length of output data. Defaults to 12. |
12
|
num_shard
|
int
|
Split the whole dataset into num_shard parts for distributed training. Defaults to 1. |
1
|
rank
|
int
|
Rank of the current process within num_shard. Defaults to 0. |
0
|
split_mode
|
str
|
If 'ceil', all |
'uneven'
|
start_date
|
datetime
|
Start time of SEVIR samples to generate. Defaults to None. |
None
|
end_date
|
datetime
|
End time of SEVIR samples to generate. Defaults to None. |
None
|
datetime_filter
|
function
|
Mask function applied to time_utc column of catalog (return true to keep the row). Pass function of the form lambda t : COND(t) Example: lambda t: np.logical_and(t.dt.hour>=13,t.dt.hour<=21) # Generate only day-time events. Defaults to None. |
None
|
catalog_filter
|
function
|
Function or None or 'default' Mask function applied to entire catalog dataframe (return true to keep row). Pass function of the form lambda catalog: COND(catalog) Example: lambda c: [s[0]=='S' for s in c.id] # Generate only the 'S' events |
'default'
|
shuffle
|
bool
|
If True, data samples are shuffled before each epoch. Defaults to False. |
False
|
shuffle_seed
|
int
|
Seed to use for shuffling. Defaults to 1. |
1
|
output_type
|
dtype
|
The type of generated tensors. Defaults to np.float32. |
float32
|
preprocess
|
bool
|
If True, self.preprocess_data_dict(data_dict) is called before each sample generated. Defaults to True. |
True
|
rescale_method
|
str
|
The method of rescale. Defaults to "01". |
'01'
|
downsample_dict
|
Dict[str, Sequence[int]]
|
Downsample_dict.keys() == data_types. downsample_dict[key] is a Sequence of (t_factor, h_factor, w_factor),representing the downsampling factors of all dimensions. Defaults to None. |
None
|
verbose
|
bool
|
Verbose when opening raw data files. Defaults to False. |
False
|
training
|
str
|
Training pathse. Defaults to "train". |
'train'
|
Source code in ppsci/data/dataset/sevir_dataset.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 | |
end_event_idx
property
¶
The event idx used in certain rank should satisfy event_idx < end_event_idx
num_event
property
¶
The number of events split into each rank
start_event_idx
property
¶
The event idx used in certain rank should satisfy event_idx >= start_event_idx
total_num_event
property
¶
The total number of events in the whole dataset, before split into different shards.
total_num_seq
property
¶
The total number of sequences within each shard.
Notice that it is not the product of self.num_seq_per_event and self.total_num_event.
__len__()
¶
close()
¶
data_dict_to_tensor(data_dict, data_types=None)
staticmethod
¶
Convert each element in data_dict to paddle.Tensor (copy without grad).
Source code in ppsci/data/dataset/sevir_dataset.py
downsample_data_dict(data_dict, data_types=None, factors_dict=None, layout='NHWT')
staticmethod
¶
The downsample of data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_dict
|
Dict[str, Union[array, Tensor]]
|
The dict of data. |
required |
data_types
|
Optional[Sequence[str]]
|
Data types to be downsampled. Defaults to all keys in |
None
|
factors_dict
|
Optional[Dict[str, Sequence[int]]]
|
each element |
None
|
layout
|
str
|
Layout string, such as "NHWT". |
'NHWT'
|
Returns:
| Name | Type | Description |
|---|---|---|
downsampled_data_dict |
Dict[str, Tensor]
|
Modify on a deep copy of data_dict instead of directly modifying the original data_dict. |
Source code in ppsci/data/dataset/sevir_dataset.py
preprocess_data_dict(data_dict, data_types=None, layout='NHWT', rescale='01')
staticmethod
¶
The preprocess of data dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_dict
|
Dict[str, Union[ndarray, Tensor]]
|
The dict of data. |
required |
data_types
|
Sequence[str]
|
The data types that we want to rescale. This mainly excludes "mask" from preprocessing. |
None
|
layout
|
str
|
consists of batch_size 'N', seq_len 'T', channel 'C', height 'H', width 'W'. |
'NHWT'
|
rescale
|
str
|
'sevir': use the offsets and scale factors in original implementation. '01': scale all values to range 0 to 1, currently only supports 'vil'. |
'01'
|
Returns:
| Name | Type | Description |
|---|---|---|
data_dict |
Dict[str, Union[ndarray, Tensor]]
|
preprocessed data. |
Source code in ppsci/data/dataset/sevir_dataset.py
SphericalSWEDataset
¶
Bases: Dataset
Loads a Spherical Shallow Water equations dataset
Training contains 200 samples in resolution 32x64. Testing contains 50 samples at resolution 32x64 and 50 samples at resolution 64x128.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_keys
|
Tuple[str, ...]
|
Input keys, such as ("input",). |
required |
label_keys
|
Tuple[str, ...]
|
Output keys, such as ("output",). |
required |
data_dir
|
str
|
The directory to load data from. |
required |
weight_dict
|
Optional[Dict[str, float]]
|
Define the weight of each constraint variable. Defaults to None. |
None
|
test_resolutions
|
Tuple[str, ...]
|
The resolutions to test dataset. Defaults to ["34x64", "64x128"]. |
['34x64', '64x128']
|
train_resolution
|
str
|
The resolutions to train dataset. Defaults to "34x64". |
'34x64'
|
data_split
|
str
|
Specify the dataset split, either 'train' , 'test_32x64',or 'test_64x128'. Defaults to "train". |
'train'
|
Source code in ppsci/data/dataset/spherical_swe_dataset.py
STAFNetDataset
¶
Bases: Dataset
Dataset class for STAFNet data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the dataset file. |
required |
input_keys
|
Optional[Tuple[str, ...]]
|
Tuple of input keys. Defaults to None. |
None
|
label_keys
|
Optional[Tuple[str, ...]]
|
Tuple of label keys. Defaults to None. |
None
|
seq_len
|
int
|
Sequence length. Defaults to 72. |
72
|
pred_len
|
int
|
Prediction length. Defaults to 48. |
48
|
use_edge_attr
|
bool
|
Whether to use edge attributes. Defaults to True. |
True
|
Examples:
>>> # get the length of the dataset
>>> dataset_size = len(dataset)
>>> # get the first sample of the data
>>> first_sample = dataset[0]
>>> print("First sample:", first_sample)
Source code in ppsci/data/dataset/stafnet_dataset.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
build_dataset(cfg)
¶
Build dataset
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Union[DictConfig, Dataset]
|
Dataset config or dataset. |
required |
Returns:
| Type | Description |
|---|---|
Dataset
|
Dict[str, io.Dataset]: dataset. |