Optimization Of Resource Allocation Or Prediction Of Minimum Requirements

May 18, 2025 by ADMIN 74 views

Introduction

In this article, we will discuss the optimization of resource allocation or prediction of minimum requirements for running simulations on a cluster of interconnected computers. We will focus on a specific use case where we are running Ubuntu and FDS 6.9.0 on a cluster of four interconnected computers.

Cluster Configuration

Our cluster consists of four interconnected computers:

Master node: 16 cores; 32GB RAM
Calculation nodes (1 of 3): 16 cores; 16GB RAM

The master node is responsible for managing the simulation, while the calculation nodes perform the actual computations.

Simulation Issues

We are encountering issues related to mesh partitioning when dividing a model into a small number of meshes. Each mesh ends up containing too many cells to handle, causing the simulation to fail to start. Through testing, we've determined that a single process cannot handle more than approximately 300,000 cells — exceeding this limit causes the simulation to crash within the first few seconds.

Workaround

To work around this limitation, we've had to divide our models into many smaller meshes to keep the number of cells per mesh under 300,000. However, this significantly increases the overall simulation time. For instance, a model with 9 million cells had to be split into 32 meshes just to initiate the simulation.

Recommended Approach

To overcome this limitation, we recommend the following approach:

Optimize mesh partitioning: Improve the mesh partitioning algorithm to reduce the number of cells per mesh.
Increase memory: Increase the memory on the calculation nodes to handle larger meshes.
Distributed computing: Use distributed computing techniques to run the simulation on multiple nodes simultaneously.
Parallel processing: Use parallel processing techniques to run multiple processes on a single node simultaneously.

Configuration Changes

To run simulations more efficiently without excessive mesh division, we recommend the following configuration changes:

Increase memory: Increase the memory on the calculation nodes to handle larger meshes.
Optimize OpenMP: Optimize the OpenMP settings to improve parallel processing efficiency.
Distributed computing: Use distributed computing techniques to run the simulation on multiple nodes simultaneously.

Estimating Hardware Requirements

To estimate the hardware requirements for models like this, we need to consider the following factors:

Number of cells: The number of cells in the model determines the amount of memory required.
Mesh size: The size of the mesh determines the number of cells per mesh.
Simulation time: The simulation time determines the amount of processing power required.

Based on our experience, we estimate that a model with 9 million cells requires:

16 cores: To handle the large number of cells.
32GB RAM: To handle the large memory requirements.
Distributed computing: To run the simulation on multiple nodes simultaneously.

Conclusion

In conclusion, optimizing resource allocation or predicting minimum requirements is crucial for running simulations on a cluster of interconnected computers. By optimizing mesh partitioning, increasing memory, distributed computing, and parallel processing, we can overcome the limitations of our current setup and run simulations more efficiently. We estimate that a model with 9 million cells requires 16 cores, 32GB RAM, and distributed computing to run efficiently.

Future Work

In the future, we plan to:

Optimize mesh partitioning: Improve the mesh partitioning algorithm to reduce the number of cells per mesh.
Increase memory: Increase the memory on the calculation nodes to handle larger meshes.
Distributed computing: Use distributed computing techniques to run the simulation on multiple nodes simultaneously.
Parallel processing: Use parallel processing techniques to run multiple processes on a single node simultaneously.

By following these recommendations and future work plans, we can improve the efficiency of our simulations and run larger models with ease.

References

[1] FDS 6.9.0 User Manual.
[2] OpenMP 4.5 User Manual.
[3] Distributed Computing with MPI.

Appendix

The following is the log of the starting process:

2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Starting FDS ...
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  MPI Process      0 started on q01
2025-05-07 10:19:13 :  MPI Process     12 started on qmaster2
2025-05-07 10:19:13 :  MPI Process      4 started on q02
2025-05-07 10:19:13 :  MPI Process      8 started on q03
2025-05-07 10:19:13 :  MPI Process      1 started on q01
2025-05-07 10:19:13 :  MPI Process     13 started on qmaster2
2025-05-07 10:19:13 :  MPI Process     14 started on qmaster2
2025-05-07 10:19:13 :  MPI Process      6 started on q02
2025-05-07 10:19:13 :  MPI Process     10 started on q03
2025-05-07 10:19:13 :  MPI Process      5 started on q02
2025-05-07 10:19:13 :  MPI Process      3 started on q01
2025-05-07 10:19:13 :  MPI Process     15 started on qmaster2
2025-05-07 10:19:13 :  MPI Process      2 started on q01
2025-05-07 10:19:13 :  MPI Process      9 started on q03
2025-05-07 10:19:13 :  MPI Process      7 started on q02
2025-05-07 10:19:13 :  MPI Process     11 started on q03
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Reading FDS input file ...
2025-05-07 10:19:13 :
2025-05-07 10:19:13 : WARNING: SPEC SFPE POLYURETHANEM27_fuel is not in the table of pre-defined species. Any unassigned SPEC variables in the input were assigned the properties of nitrogen.
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Fire Dynamics Simulator
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Current Date     : May  7, 2025  10:19:13
2025-05-07 10:19:13 :  Revision         : FDS-6.9.0-0-g6339569-release
2025-05-07 10:19:13 :  Revision Date    : Wed Mar 20 13:59:17 2024 -0400
2025-05-07 10:19:13 :  Compiler         : Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.1 Build 20221019_000000
2025-05-07 10:19:13 :  Compilation Date : Mar 21, 2024 07:58:03
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Number of MPI Processes:  16
2025-05-07 10:19:13 :  Number of OpenMP Threads: 4
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  MPI version: 3.1
2025-05-07 10:19:13 :  MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :
2025-05-07 10:19:13 :  Job TITLE        :
2025-05-07 10:19:13 :  Job ID string    : test
2025-05-07 10:19:13 :
2025-05-07 10:19:39 :  Time Step:      1, Simulation Time:      0.09 s
2025-05-07 10:19:43 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-05-07 10:19:43 : Image              PC                Routine            Line        Source            
2025-05-07 10:19:43 : libc.so.6          0000745424042520  Unknown               Unknown  Unknown
2025-05-07 10:19:43 : fds_openmp         00000000073A5394  Unknown               Unknown  Unknown
2025-05-07 10:19:43 : fds_openmp         000000000717EAE5  Unknown               Unknown <br/>
**Optimization of Resource Allocation or Prediction of Minimum Requirements: Q&A**
================================================================================

**Q: What is the main issue with running simulations on our cluster?**
----------------------------------------------------------------

A: The main issue is that we are encountering problems related to mesh partitioning when dividing a model into a small number of meshes. Each mesh ends up containing too many cells to handle, causing the simulation to fail to start.

**Q: What is the recommended approach to overcome this limitation?**
----------------------------------------------------------------

A: The recommended approach is to optimize mesh partitioning, increase memory, use distributed computing, and parallel processing to run the simulation on multiple nodes simultaneously.

**Q: How can we optimize mesh partitioning?**
--------------------------------------------

A: We can optimize mesh partitioning by improving the mesh partitioning algorithm to reduce the number of cells per mesh.

**Q: What is the estimated hardware requirement for a model with 9 million cells?**
--------------------------------------------------------------------------------

A: The estimated hardware requirement is 16 cores, 32GB RAM, and distributed computing to run the simulation on multiple nodes simultaneously.

**Q: What is the role of the master node in the cluster?**
---------------------------------------------------

A: The master node is responsible for managing the simulation, while the calculation nodes perform the actual computations.

**Q: What is the role of the calculation nodes in the cluster?**
---------------------------------------------------------

A: The calculation nodes perform the actual computations and handle the large number of cells.

**Q: What is the recommended configuration change to run simulations more efficiently?**
-----------------------------------------------------------------------------------

A: The recommended configuration change is to increase memory, optimize OpenMP, and use distributed computing to run the simulation on multiple nodes simultaneously.

**Q: What is the estimated simulation time for a model with 9 million cells?**
--------------------------------------------------------------------------------

A: The estimated simulation time is significantly increased due to the need to divide the model into many smaller meshes to keep the number of cells per mesh under 300,000.

**Q: What is the future work plan to improve the efficiency of simulations?**
--------------------------------------------------------------------------------

A: The future work plan includes optimizing mesh partitioning, increasing memory, using distributed computing, and parallel processing to run the simulation on multiple nodes simultaneously.

**Q: What are the references used in this article?**
---------------------------------------------------

A: The references used in this article are the FDS 6.9.0 User Manual, OpenMP 4.5 User Manual, and Distributed Computing with MPI.

**Q: What is the log of the starting process?**
------------------------------------------------

A: The log of the starting process is provided in the appendix of this article.

**Q: What is the error message displayed when the simulation crashes?**
----------------------------------------------------------------

A: The error message displayed when the simulation crashes is a segmentation fault occurred, which is caused by exceeding the limit of 300,000 cells per mesh.

**Q: What is the recommended approach to handle the error message?**
----------------------------------------------------------------

A: The recommended approach is to optimize mesh partitioning, increase memory, use distributed computing, and parallel processing to run the simulation on multiple nodes simultaneously.

**Q: What is the estimated cost of running simulations on the cluster?**
-------------------------------------------------------------------

A: The estimated of running simulations on the cluster is not provided in this article, but it is expected to be high due to the need for large amounts of memory and processing power.

**Q: What is the future work plan to reduce the cost of running simulations?**
--------------------------------------------------------------------------------

A: The future work plan includes optimizing mesh partitioning, increasing memory, using distributed computing, and parallel processing to run the simulation on multiple nodes simultaneously, which is expected to reduce the cost of running simulations.