|
EPFL-SCR No 12
| ![]() |
|---|
Jean M. Favre, Sathya Krishnamurthy, CSCS, Swiss Center for Scientific Computing
Outils et environnements de visualisation pour des très grands volumes de données
The facility with which computational scientists and experiments can generate data has often out-paced the capacity of the post-processing tools available in High Performance Computing environments. This is changing however, as scientific visualization is becoming itself an HPC activity. In this article, we wish to review systems and techniques which are becoming feasible for the ingestion and graphical display of very large data sets. Based on our experience and a review of current practices, an attempt is made to describe the best visualization methods. From data management issues on the files servers, to information extraction, to data processing and advanced 3D graphics, we propose a journey through the data production and visualization events, reviewing some of the latest technologies available.
Data visualization is reaching an all-time high.
How many times have we heard this statement in the last
few years? Year 2000 is already marked by the first
publicized 11.5 billion cell visualization [1], and the trend shows no signs of slowing down. Large-scale environments
that attempt to optimize the use of resources at all levels of
the data visualization chain are supplanting the
traditional desktop graphics workstation. File access and data
retrieval are the first essential links of this infrastructure. When
a dataset does not fit in local memory, or in the local
disk space available, a set of non-trivial techniques must be
put in place to accelerate its retrieval. Second, comes
the information extraction, the essence of data
visualization. Parallel or distributed implementations are now
possible, paving the way towards
computational-steering architectures.
Parallelism has also moved to the graphics
hardware systems, with the multi-pipe rendering servers and the
large displays. These computational servers can offer a
tight coupling between simulation and visualization, or they
can be used to generate remote graphics, leaving no much
work on the client side but the display of static images. To
offer interactivity using remote rendering requires a minimum
of 10 frames per second, using potentially over 30 Mbytes
per second of bandwidth for full-screen uncompressed
images. Generally though, visualization must provide the
interactive means to explore data and their representations. To
accelerate this process, we will see how improvements in 3D
graphics technology are contributing to handling objects made
of millions of graphics primitives.
The images and movies produced during visualization remain static snapshots of the dataset under study. An additional added value of a visualization environment is its ability to reproduce results, and to facilitate side-by-side comparisons. To this effect, we will see the role played by scripting languages in visualization. They provide the indispensable programs necessary such that the deployment of a complete visualization chain becomes less tedious.
The need for some form of Data Management for efficient data access becomes obvious when dealing with datasets beyond a few hundreds of megabytes. Data access should be optimized on a variable basis, on a computational block basis, and at specific time-steps for transient solutions analysis. Such access should be provided without the need for searching. It should provide support for self-descriptive and context-free data manipulation. For example, MemCom [2] is a data management system that has been specifically designed for engineering applications, like computational solid mechanics, computational fluid dynamics, and coupled multi-disciplinary applications. MemCom consists of a wide range of functions for data definition and data manipulation, as well as auxiliary tools. The data manipulation functions are not tied to specific applications and APIs for C, C++, Fortran, as well as a CORBA interface exist. Access to large collections of data has been optimized in MemCom, and it fits very well the concept of load-on-demand.
A user-defined reader created to provide a native interface to MemCom databases for the EnSight software (www.ensight.com) was created at the Swiss Center for Scientific Computing (CSCS) [3], and took advantage of the very explicit and clear hierarchies of template sub-routines specified by the EnSight environment which matches extremely well the part, block, variable, and time-step access sub-routines offered by MemCom.
Data parallelism is essential for contemporaneously processing independent subsets of data. In Vtk, the implementation of data parallelism does not require any additional changes to the toolkit. To write a program that expresses data parallelism:
Client-server applications are the first examples of distributed computing. They are generally found under the following three scenarios.
In another development, the Vtk visualization library is also extended to support multiple processes [9]. A system process object encodes whether the system is distributed (via MPI), or shared memory (via pthreads or sprocs). In many of the recent visualization softwares, execution is based on the data-flow approach. To make parsimonious use of the computing resources when confronted with large data, it is then recommended to drive the execution in an event-driven fashion. Multiple visualization parameters and queries can then be grouped before requesting the data required. This is in contrast to environments which are demand-driven, and whose performance under heavy loads suffers from too many update requests.
Visualization is traditionally achieved by the creation of geometric representations, or of pixel-based imagery. Many opportunities exist to optimize this graphical data display. OpenGL offers now many features useful for the display of engineering data. Vertex arrays for example, are a recent feature (OGL 1.1) which allows to send entire list of primitives (triangles, or quadrilaterals for example) by a single call to the rendering API, specifying at once, all the pointers where coordinates, colors, textures can be found in block-optimized regions of memory on the client side. Others ways to minimize data exchange between the OpenGL client and its server process are display lists, or textures objects.
To improve interactivity, one may also use different levels of detail (LOD) - bounding boxes, clouds of points, surface decimations with a smaller number of primitive cells or higher order representations such as parametric surface hulls for interactive display, switching then to full resolution graphics for static image production.
The computational demands of volume rendering require the use of a high degree of hardware parallelism. In addition, volume rendering is a memory intensive operation; the design of the memory system is critical in volume rendering architectures. Texture mapping hardware, which is a common feature of modern 3D graphics accelerators, can be exploited for volume rendering by applying a method called planar texture resampling. The volume is stored in 3D texture memory and resampled during rendering by extracting textured planes parallel to the image plane. Lookup tables map density to RGBA color and opacity. The resulting texture images are combined in back_to_front visibility order using compositing. There are certain limitations that are encountered while going from 2D-texture to the 3-D texture approach:
Transition for code development between workstation and multi-pipe renderings architectures can be done with the Multi-pipe Utility www.devprg.sgi.de.devtools/tools/MPU), a programming interface for OpenGL. It allows the development for large-scale environments such as CAVEs, Power-Walls, ImmersaDesks and the like. As an example, it allows a multiprocessor, multi-pipe application to be developed and tested on a single-graphics board desktop workstation, without recompilation. Another library developed at Lawrence Livermore National Laboratories is the Virtual Display Library, VDL [14]. VDL provides a simple API for threaded, multi-pipe rendering through basic display management services, such as window and thread creation, and double-buffer window synchronization.
Programming these graphics supercomputers is thus becoming much easier, and more affordable, since their power can be harnessed and shared by remote users.
Scripting is important for rapid prototyping before compiling an application, for batch mode submission when software rendering is possible, or to be able to repeat the same image creation with a new dataset. The programmer can be confronted with proprietary scripting languages (e.g. AVS5, AVS/Express, EnSight) that are more difficult to comprehend. Other approaches offer interface via a language that is more widespread like Tcl (e.g. Vtk, ICEM Visual3). Important in all scripting support, is the ability to write loops and control flow structures, and to favor code reuse for callable macros containing common sets of instructions. A difficulty of editing journal files is also that they can be very state-dependent. Execution must be carried in a specific order, often dependent also on the number and names of the variables created. Finally, not all commands available through the User Interface are always possible to script. For example, in EnSight, the selection of parts can be done via the common UNIX syntax of wildcard naming. Yet, this is translated into a script command explicitly selecting parts by names, and it cannot be programmed directly via a helper application. When one of the difficulties of handling large data is also due to the large number of computational blocks and their derived graphics representations, using a helper application to automatically write the script is also critical. A program of this kind was created to support visualization of MemCom databases in aerodynamics [3]. The NSMB code development for example generates between 100-1000 blocks for detailed simulations of aircrafts [15]. Surface extraction and other routine data extraction must be completely automatized to save time and reduce scripting errors (fig. 1).
Fig 1 - Flow simulation around the F18 of the Swiss Army, by Alain Gehri and Jan Vos, CFS Engineering, Lausanne. An EnSight script automatically queried a MemCom database storing 194 computational zones for 4.4 million nodes. The graphics is delivered via a 100Mbytes/sec line from a multi-CPU file server to a graphics workstation.Interactive queries are carried over TCP sockets.
Scientific visualization is no longer constrained to small data handling. From PC-based consumer boards which can process volumetric data in real-time, to large HPC environments including data access, visualization extraction, and graphical representation, all processed in parallel, the environments available today can easily process distributed data and deliver 3D results to remote desktops. A mix of expensive hardware and many advanced software developments can make the visualization of tera-bytes of data a common feat. Yet, other emerging activities will clearly complement the advanced environments of today. Feature detection for automatic searches through very large Fluid Dynamics databases is becoming available in commercial products for the identification of vortex cores, shock waves, separation lines and surface flow topology [16]. These will replace the very tedious and error-prone interactive queries that can render visualization systems completely ineffective. Data Mining and Knowledge Discovery are also contributing to gaining insight from large protein databases and others huge data depositories. In this context, and for very large data handling, visualization will perhaps turn back to a batch-oriented activity trained to deliver only the quintessential features of large data banks.
refer to contents |
©EPFL-SCR # 12 - 2000 |
| your comments |