CEA, March 2024
Francesco Rizzi
(NexGen Analytics)
nexgenanalytics.github.io/cea-seminar-march-2024/
nexgenanalytics.github.io/cea-seminar-march-2024/
KokkosSparse::CrsMatrix
parallel_*
and View
github.com/Pressio/SHAW
was accepted to the ECP proxy apps catalog
proxyapps.exascaleproject.org/app/shaw
Many-query problems need surrogate models
e.g. uncertainty quantification (UQ), design optimization
Projection-based reduced-order models (pROMs):
Project the governing equations onto a subspace
Explainable (physics), error bounds, full field predictions
Historically relying on linear subspaces (e.g. POD):
POD: fast to compute, few knobs to tune
Advection/hyperbolic problems need many linear modes
pROMs community sees this as a "limitation/bottleneck"
Can we make virtue out of necessity?
Freytag's pyramid: introduced in 1863 by Gustav Freytag
E.g. Hänsel e Gretel - Grimm, 1812
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
Ubiquitous in science and engineering
Parameters, their uncertainties and correlations are critical
are parameters (material props, geometry, BCs)
Credit: UT Austin (link)
Source: Web (link)
Source: NASA (link)
Source: myself
Parameter count is (generally) a good indicator of complexity
Source: NVIDIA (link)
< 10
10 - 100
>> 100
Uncertainty Quantification (UQ):
Can we just take any problem and apply UQ?
It depends! How many parameters is too many for UQ?
Any guess?
# of parameters | Total runs | Total simulation time |
---|---|---|
2 | 25 | ~4 mins |
3 | 125 | ~20 mins |
5 | 3,125 | ~9 hours |
7 | 78,125 | ~9 days |
9 | 1,953,125 | ~32 weeks |
11 | 48,828,125 | ~15 years |
15 | 30,517,578,125 | ~9,000 years |
20 | 95,367,431,640,625 | ~3e7 years |
... | ... | ... |
Assume: 1 simulation = 10 secs, 2 parameters, 5 pts along each axis
Possible counter arguments:
Can we break the trend and make virtue out of using a "large" linear subspaces?
1901
1987
PCA invented
(Karl Pearson)
Method of POD snapshots
(L. Sirovich)
Manifold learning
starts ...
~2017
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
Surface waves: travel at the Earth's surface
Body waves: travel through the Earth
Affected by the material properties (density, modulus)
Primary (P-waves) are compressional, secondary or shear (S-waves) are transversal (particles oscillate perpendicularly to the direction of wave propagation)
Very limited model reduction for this
Likely neglected by pROM community since it is hyperbolic
Surface
Shear effects negligible in liquid, the core is not considered
Surface
Core-mantle
Given:
Find:
* H. Igel, M. Weber, Geophys. Res. Lett. 22 (6) (1995)
Sparse large coefficient matrices
(depend on material properties, not on time)
Open-source, using Kokkos: proxyapps.exascaleproject.org/app/shaw
Cartoon, not real sparsity pattern!
Contour plots of velocity field: Ricker wavelet source T = 60 sec, depth = 640 Km
Interference
Reflection
Refraction (from discontinuities)
PREM Earth model
time = 250 sec
time = 1000 sec
time = 2000 sec
Time-evolution of the
velocity field contours
FOM
ROM
For simplicity, assume same # of modes = K for velocity/stresses
Approximation:
Approximation:
Also called the ROM "offline stage"
Execute solves of the FOM for "training" parameter instances
and collect the "snapshots" (mode-1 concatenation)
# time steps
# state dofs
Identify low-dimensional structure in data (POD)
# time steps
# state dofs
=
Many modes needed, as expected
Data-driven interpolation fails
Velocity field at time = 2000 secs computed for the forcing period T=69 (extrapolation point)
ROM using 436 modes for velocity and 417 modes for stresses
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
# of degrees of freedom | runtime | |
---|---|---|
Full-order model | ~3,150,000 | t |
pROM | ~850 | 0.1 t |
Reduction factor | 3700 | 10 |
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
How do we evaluate the efficiency of this kernel?
Assume: square system of size , using doubles
FLOPS:
Data movement: (read ) + (read ) + (write )
Result ~1/4 (flops/byte)
gemv kernel:
1/4
Memory bandwidth bound!
theoretically attainable performance depending on the arithmetic intensity
Modern many-cores chips
Best when:
cores are kept busy, data is local
access patterns are optimal for the targeted arch
Standard Galerkin ROM
Source: web
This is useful when we need many solves, UQ
Let's consider M trajectories simultaneously
e.g. different forcing evaluations
Let's put on the UQ and HPC glasses
~ K / 16
This is now a function of K (# of modes)
Standard formulation
Rank-2 formulation
use kokkos-kernels with OpenMP backend;
workstation with two 18-core Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz (24.75MB L3 cache, 125GB total mem)*
M=1: very limited
M>1: increasing # of threads helps
A large K and M is an advantage!
Allows to fully exploit the machine!
M = # of simultaneous trajectories
M = 1 : standard pROM
M >= 2: rank-2 ROM formulation
* F.Rizzi et al, CMAME, 2021
What combination of thread count (n) and number of trajectories, M, would be the most efficient to obtain those P samples while satisfying the given constraints?
Suppose:
Launch 36 single-thread ROM runs each using M=1
and repeat until all P samples are done
CPU 0
CPU 1
Launch 18 two-threaded ROM runs each using M=1
and repeat until all P samples are done
If we increase # of modes (K), things improve!
# of modes (K) = 512
# of modes (K) = 2048
Greener is better
Recall that standard ROM is the case M = 1
inadmissible
combinations
inadmissible
combinations
The larger the number of modes, the more efficient it is to evaluate an ensemble of trajectories!
# of modes (K) | How many times more efficient than rank-1 pROMs? |
---|---|
256 | 13x |
512 | 19x |
1024 | 23x |
2048 | 26x |
MC study: 512 trajectories sampling the forcing period T
Rank-2 ROM is 950 times faster than FOM
If FOM takes 1 hour, ROM takes 3 seconds
Introduction
Initial event
Rising action
Climax
Falling action
Resolution
denouement
Aeroelasticity:
deforming structures modeled as linear, with a nonlinear load
Acoustic waves:
modeled with a linear PDE, but can have a number of nonlinear sources (turbulent shear layers from wakes)
Neutral particle (neutron, photon, etc.) transport
Linear circuit models
What about if the matrix A changes?
What about nonlinear problems?
Tensors are getting more and more attention
ROMs for LTI can benefit from them
Leverage hardware evolution: CUDA tensor cores
Rank-3 ROMs?
Batched gemm?
A compute-bound formulation of Galerkin model reduction for linear time-invariant dynamical systems, F.Rizzi, E.Parish, P.Blonigan, J.Tencer, CMAME, 2021
Eric Parish (SNL)
Patrick Blonigan (SNL)
John Tencer (SNL)
This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This work was funded by the Advanced Simulation and Computing program and the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525
In collaboration with Sandia National Labs.
Questions?
francesco.rizzi@ng-analytics.com
nexgenanalytics.github.io/cea-seminar-march-2024/
If you are here today, likely you use and/or study and/or believe in surrogate modeling. So I could spend minutes on this but...
Computing/hardware progresses and changes quickly
Exascale is already here: China has two machines
How does this impact surrogates (if at all)?
Can we/how to leverage this for surrogate modeling?
"It allows me to run my same old surrogate faster": not ideal!
More synergistic development of surrogates and computing?
Source: https://www.alcf.anl.gov/files/DMello-Nguyen-ALCF-CP-Workshop-MKL-2019-05-01-2019.pdf
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Historically, a key focus of pROMs work has been:
"finding the smallest subspace that can represent/solve a problem"
Intuitively : small system, more convenient to compute
Mathematically : intriguing but hard
Computationally: is this really the best approach?
What if we can formulate the problem such that we don't need to reduce it so much while being efficient?
This talk aims to provide a counterargument
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Focus on pROMs for LTI systems
Weren't they a solved problem...?
Emphasis on computational aspects
Little math, error bounds, ML, deep learning (sorry!)
Disclaimer: this work might seem "obvious" (depending on whom you are talking to)
Format: this is going to be a "story"
Walk through how this work started and developed
Finally, we talk about generalization
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Hardware has changed from the 80's!
Visual performance model obtained by plotting:
performance (in GFLOPs/s) against their arithmetic intensity
Evaluate resource efficiency by relating its algorithm’s arithmetic intensity relative to the hardware’s peak main-memory bandwidth and floating-point performance
Hardware limitations for a given kernel, prioritize optimizations
Numerics
Approximate
Exact
Physics/Equations
Approximate
Exact
Full-order model
(FOM)
pROMs
Data-driven surrogates
Reduced
physics