November 27, 2006
The HPC4U European research project active in GRID computing technologies just released the first freeware version of its fault tolerant grid middleware providing fault tolerance for parallel applications. This system, based on a Linux kernel running as MS Windows service (coLinux), offers the user the possibility to launch parallel application on virtual nodes in order to test fault tolerance mechanisms in action. (User can start a parallel compute job on two compute nodes, killing one of these nodes and seeing, within a second, the job restarting on two other nodes.)
HPC4U's freeware version uses a coLinux system. It is a virtualisation which, in contrast to other systems such as VMware, does not emulate an entire machine but allows running the Linux kernel as an MS Windows service. Using coLinux makes it easier to run since the operating system is booted from a CD-Rom or a DVD device without any previous installation on the computer disk. This coLinux based system uses CCS and two free and open source components offering basic fault tolerance mechanisms for parallel applications. These components are respectively BLCR (Berkeley Lab Checkpoint/Restart) and LAM-MPI.
BLCR allows programs running on Linux to be "checkpointed" (written entirely to a file), before being "restarted." BLCR performs checkpointing and restarting inside the Linux kernel. While this makes it less portable than solutions that use user-level libraries, it also means that it has full access to all kernel resources and can thus restore resources (like process IDs) while user-level libraries cannot. In the future, this will also allow BLCR to checkpoint/restart entire sessions and/or process groups (such as shell scripts and their sub processes).
LAM-MPI is an open-source implementation of the Message Passing Interface specification, including all of MPI-1.2 and much of MPI-2. One of the main advantages of using LAM-MPI in the HPC4U freeware bundle is the native compatibility with BLCR. Indeed, as detailed on their website, MPI applications running under LAM/MPI can be checkpointed to disk and restarted later at a later time or stage. LAM requires a third party single-process checkpoint/restart toolkit to actually checkpoint and restart a single MPI process -- LAM handles the parallel coordination.
The combination of all these free and open source components coupled to CCS as Resource Management System developed by UPB and used by HPC4U will offer the possibility of testing HPC4U basic functionalities. Users will just have to boot their computer by using the provided DVD, transforming temporarily those computers into compute nodes, and will have to test fault tolerance mechanisms on a given application.
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
Read more...
Frank Ding, engineering analysis & technical computing manager at Simpson Strong-Tie, discussed the advantages of utilizing the cloud for occasional scientific computing, identified the obstacles to doing so, and proposed workarounds to some of those obstacles.
Read more...
The private industry least likely to adopt public cloud services for data storage are financial institutions. Holding the most sensitive and heavily-regulated of data types, personal financial information, banks and similar institutions are mostly moving towards private cloud services – and doing so at great cost.
Read more...
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...
May 10, 2013 |
Australian visual effects company, Animal Logic, is considering a move to the public cloud.
Read more...
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
Read more...
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/02/2012 | AMD | Developers today are just beginning to explore the potential of heterogeneous computing, but the potential for this new paradigm is huge. This brief article reviews how the technology might impact a range of application development areas, including client experiences and cloud-based data management. As platforms like OpenCL continue to evolve, the benefits of heterogeneous computing will become even more accessible. Use this quick article to jump-start your own thinking on heterogeneous computing.