thanks nico, the url is

@:

http://vxl.net/Products/products.aspx?CategoryID=Nw==-ejXBZi/4MdY=&SubCategoryID=MTE=-DHKo1OyF9dE=&ProductID=NjQ%3d-v%2b1AeSBY%2bCg%3d

the write up i

found when looking up compatibility is as follows....

Monday, November 01

2010

Using Parallel Processing on GPUs to Accelerate Finite-Difference

Time-Domain Algorithms for Electromagnetic and Seismic Apps...

B

Many

scientific, technical and engineering applications in finance, medical

imaging, modeling,

simulation, and image processing can benefit greatly from

the floating

point acceleration offered by modern general purpose graphics

processing units (GPGPU). Todaybs graphics processors have evolved into

sophisticated, massively-parallel, highly-programmable compute engines

ideally suited for algorithms with a high degree of data parallelism.

Combined with modern parallel programming languages and application

interfaces such as industry standard OpenCLb", GPGPU offers a new

paradigm

for high-performance computing (HPC).

B

Figure 1. ATI Radeon HD 5870 GPU

Architecture

In this article we plan to examine the use of GPGPU to accelerate

iterative, grid-based, finite-difference methods used extensively in HPC

applications. Finite-Difference Time-Domain (FDTD) solvers are used to model

electromagnetic wave propagation. Velocity-stress

finite-difference

time-domain (VS-FDTD) solvers are used to model

acoustic wave propagation in

seismic applications. Both algorithms are

ported to the ATI Radeonb" HD

5870 and ATI Radeonb" HD 5970 GPU

co-processors and benchmarked against a

CPU implementation. Letbs start with an overview of the ATI Radeon HD 5870

GPU.

ATI Radeon HD 5870 Graphics Processor

Overview: The ATI Radeon HD 5870

GPU is a high-performance graphics processor

delivering 2.72 teraFLOPS

(peak) single precision floating point

performance, at 14.47 gigaFLOPS/

Watt. The GPU supports the OpenCL 1.0

API and complies with the IEEE

754-2008 floating point arithmetic

standard. Configured with 1 GByte of

GDDR5 memory, the GPU supports 153 GBytes/sec of memory bandwidth to local

memory. Communication with a

host CPU is via the x16 PCI ExpressB. 2.0 bus.

B

Architecture: For GPGPU applications, key functional blocks include the 20

single-instruction, multiple-data (SIMD) engines, thread dispatch

controller, memory controller, instruction / constant / memory caches

and 64

Kbytes of global memory (Figure 1). Each of the SIMD engines

contain an

array of 80 stream processor elements (PEs), a 32 KByte local data store, a

fetch unit with control logic and an 8 KByte L1 cache.

Each SIMD engine is

able to run its own instruction threads.

Communication with other SIMDs is

through the 64 KByte global memory. To maximize computation throughput and

hide memory latency, the

ultra-threaded dispatch processor launches

multiple, simultaneous thread contexts. Memory bandwidth is optimized by four

memory controllers

connected to four 128 KByte L2 caches that coalesce

memory reads and

writes to drive the external 256-bit wide memory bus.

B

B

Each SIMD engine contains 80 stream PEs organized into 16 thread

processors.

Each thread processor consists of 5 PEs, a branch unit and

general purpose

registers. The 5 PEs include 4 32-bit general purpose

floating point

multiply and add units and 1 special purpose unit to

support floating point

multiply, add, and transcendental functions.

Performance: Using the GPGPU

Computational Performance benchmark from SiSoftware

Sandra 2010, the ATI

Radeon HD 5870 GPU scores 1820 Mpixels/s.[a]

SiSoftware Sandra calculates

the Mandelbrot set using OpenCL.B B B

Benchmarking with ComputeMark, the

ATI Radeon HD 5870 GPU scores

2212.[b] ComputeMark is a DirectXB. 11 Shader

Compute benchmark and is

well suited to measure GPGPU performance, as it is

capable of utilizing 99% of GPGPU resources.

B

Processor Performance

Benchmarks

OpenCL: The Open Computing Language, or OpenCL, is the open

standard for

parallel programming of GPGPUs and multi-core CPUs. OpenCL

provides for a portable and a high-performance framework for the development

of cross platform and vendor independent applications. An OpenCL application

is partitioned into two primary components comprised of one or more

computational kernels that are executed on the compute devices, and a

host

program that coordinates the asynchronous execution of the kernels and

manages the movement of data to distributed memory resources.

GPGPU

Implementation of Finite-Difference Time-Domain Solvers

Finite-Difference

Time-Domain Methods: Finite-Difference Time-Domain (FDTD) methods are used to

iteratively

solve time-dependent partial differential equations by

discretizing the equations and representing the solution on a spatial grid.

The solution is evolved in time using an iterative time-stepping algorithm. An

important feature of FDTD algorithms is the use of a local stencil so

that

the solution at any point in the grid is dependent upon the values of

neighboring grid points. FDTD methods are used in many fields of

scientific

modeling and simulation and form the basic computational

kernel upon which

many software packages are built.

Electromagnetics: The FDTD method is used in

computational electromagnetics where the

time-dependent relationship between

electric and magnetic field

components is governed by Maxwellbs equations,

B

B

Using the FDTD method [1], the electric (E) and magnetic (H) field

vectors

are represented on staggered-grids and the solution is generated by

updating E and H using a leap-frog algorithm that approximates the

time-dependent solution through small discrete time-steps. The result

is a

time-varying solution that describes the propagation of

electromagnetic

waves. The FDTD method for electromagnetics is used for modeling and

simulation in military and commercial applications

ranging from radar to

radio communications.

Seismic: The FDTD method is also used to model seismic

wave propagation governed by the elastic wave equation,

B

The time-dependent

solution describes the propagation of acoustic waves in an elastic medium.

The solution for the particle velocity vector (N=)

and stress tensor (O) is

represented on staggered-grids similar to

those used in computational

electro-magnetics. The Velocity-Stress FDTD (VS-FDTD) method [2] is an

important algorithm used in seismic forward modeling with applications to oil

and gas exploration and also

military applications for detecting buried

structures.

B

GPU Implementation: The basic FDTD algorithm is implemented

using OpenCL targeting the ATI Radeon HD 5870 and ATI Radeon HD 5970 GPU

co-processors. The E and H

vector fields are represented on

three-dimensional spatial grids and

iteratively updated using local stencil

operations that approximate the derivatives in the governing equations. The

FDTD algorithm exhibits a

low ratio of computation to memory access making

data flow the primary

consideration. The E and H vector fields are stored in

separate arrays, with the vector components interlaced in stripes along the

fastest

index. The grids are re-ordered along the fastest index to optimize

the use of SIMD operations on the GPU by aligning values used in non-local

stencil operations. The updates for E and H are performed using

separate

OpenCL kernels executed in succession to approximate the

time-evolution of

the quantities over discrete time-steps.

The

GPU implementation of the

VS-FDTD algorithm for seismic wave

propagation has a similar structure, but

differs in two ways. First,

the amount of data represented per grid point is

greater since the

model is comprised of a vector and symmetric tensor field

(3+6

components) as compared with the two vector fields (3+3 components)

used for electromagnetics. Furthermore, the seismic application uses

more

spatially dependent physical parameters. Second, the complexity of the

computation of the updates is slightly more complex for VS-FDTD.

In order to

utilize the dual GPUs on the ATI Radeon HD 5970, steps are

introduced to

exchange the edges of the sub-grids over which the

simulation is

distributed. The algorithm performs two iterative updates of each quantity

and then exchanges edges sufficient to join the

sub-grids. This technique

can be extended to larger simulations

utilizing additional GPUs per node or

multiple nodes, where the latter

case would require the use of MPI for

inter-node communication.

Figure 2 illustrates the resulting electromagnetic

(top) and seismic (bottom) simulations obtained by executing the FDTD and

VS-FDTD algorithms on

the ATI Radeon HD 5870 and ATI Radeon HD 5970 GPUs

respectively.

B

B

Performance: Benchmarks were performed for both FDTD and

VS-FDTD algorithms using

an ATI Radeon HD 5870 GPU and ATI Radeon HD 5970

GPU, along with a

standard C implementation targeting a CPU as a reference

(Table).[c]

The multi-core performance was extrapolated from results for a

single

core implementation on a hexacore processor, which places an upper

bound on performance that is unlikely to be achieved with six (6) cores

competing for resources. Performance is reported in terms of the time

per

iterative update normalized to a million-point grid.

B

Conclusions

B

The

results demonstrate that these algorithms can be accelerated using an

OpenCL

implementation targeting GPUs. Relative performance between the

GPU and CPU

is very similar for both algorithms, with the seismic

VS-FDTD showing

slightly better acceleration due to the increased

computational complexity.

For the electromagnetic FDTD and seismic

VS-FDTD algorithms, the

acceleration using a single-GPU ATI Radeon HD

5870 is 2.9x and 3.2x,

respectively. The acceleration factors are

increased to 3.7x and 4.2x,

respectively, for the dual-GPU ATI Radeon

HD 5970, where the additional

overhead of swapping edges between

sub-grids and the reduced clocks prevents

perfect scaling.

This article was written by David Richie, Ph.D., President,

Brown Deer

Technology (Forest Hill, MD) and Peter Mandl, Senior Product

Marketing

Manager, AMD (Sunnyvale, CA). For more information, contact Mr.

Richie

at

[hidden email] e-mail address is being

protected from spambots. You need JavaScript enabled to view it , Mr.

Mandl

at

[hidden email] e-mail address is being protected

from spambots.

You need JavaScript enabled to view it , or visit

http://info.hotims.com/28059-402.

B

Footnotes

B

[a] System configuration:

AMD Phenomb" II X4 940 processor-based system, 3 GHz, ASUSTek M3A79-T

DELUXE, 4BG DDR2-166, WindowsB. 7 64-bit Enterprise OS, ATI RadeonoL HD 5870

with 1GB memory, GPU clock 850 MHz, memory

clock 1.2 GHz, Driver 8.680.0.0,

OpenCL base build, SDK 2.0 Beta 4.

[b] System configuration: AMD Phenom II X4

940 3 GHz (Deneb), ASUSTeK

M3A79-T DELUXE, 4GB Kingston DDR2-1066 MHz,

Windows 7 64-bit ATI Radeonb" HD 5870 with 1GB memory, GPU clock 850 MHz,

memory clock 1.2 GHz, ATI Driver: 8.73 base kit.

[c] System configuration (i)

ATI Radeon HD 5870 single GPU, Memory: 1GB GDDR5, 850 MHz core clock, 1200 MHz

memory clock, Catalyst 10.2, ATI Stream SDK V2.1, BDT coprthr_sdk:

v1.0-RC1

(ii) Visiontek Radeon HD 5970 dual-GPU, Memory: 2GB GDDR5, 725 MHz core

clock, 1000 MHz memory clock, Catalyst 10.2, ATI Stream SDK

V2.1, BDT

coprthr_sdk: v1.0-RC1 (iii) ASUS M4A79T motherboard, CPU: AMD Phenom II X6

1090T BE @ 3.2 GHz, Memory: 4C4GB DDR3-1600 (9-9-9-24

timing), OS: Linux

CentOS 5.4 x86_64.

References

B

[1] A. Taflove and M.E. Brodwin, IEEE Trans.

Microwave Theory Techn., Vol. 23, pp. 623-630, (1975).

[2] S.E. Minkoff, J.

Sci. Comput. Vol. 24, No. 1, pp 1-19 (2002).B

B

B

B

B

http://rns-thoughts.blogspot.com/tel : +61402 350 315

Rajneesh N. Shetty

>________________________________

>From: Nico Kadel-Garcia <

[hidden email]>

>To: Rajneesh N. Shetty <

[hidden email]>

>Cc:

"

[hidden email]" <

[hidden email]>; Misc list

<

[hidden email]>

>Sent: Sunday, 17 July 2011 5:18 PM

>Subject: Re: CVS:

cvs.openbsd.org: ports

>

>On Sun, Jul 17, 2011 at 2:55 AM, Rajneesh N. Shetty

><

[hidden email]> wrote:

>> would obsd 4.9 B work ok on the

attached specifications?

>>

>> please advise if

>> anyone has tried it so far.

this one is a "notebook".

>>

>> they have an athlon

>> version as well which

is a "netbook", but i'am not too sure i want to try that

>> one for bsd yet...

>

>[ text snipped ]

>

>>

>> [demime 1.01d removed an attachment of type

application/pdf which had a name of TL120 Series_VXL.pdf]

>

>Looks like your

PDF got snipped. Perhaps you can publish a URL instead?