Device Simulator nanoMOS 2.0 Using a 100 Nodes Linux Cluster

(1)

Parallelization of the Nanoscale

Device Simulator nanoMOS 2.0 Using a 100 Nodes Linux Cluster

IEEE Nanotechnology Conference Aug 26-28, 2002

Washington DC

Sebastien Goasguen, Ali. R. Butt , Kevin D. Colby, Mark Lundstrom

Purdue University

Electrical Engineering Department

(2)

Outline

Problem overview

nanoMOS 2.0

The cluster and PUNCH

Parallelization

Parallelization under Matlab

MPITB and PVMTB

PVMTB basics

Parallelization of Ballistic NEGF in nanoMOS 2.0

Parallelization of a detailed scattering model

(3)

nanoMOS 2.0

silicon dioxide silicon dioxide

Gate

drain source

SiO₂

L = 10 nm

energy--->

nanoMOS 2.0

• 2D simulation of nanoscale silicon SOI MOSFETs

-NEGF

-BTE (ballistic) -drift-diffusion -energy transport -density gradient -effective potential

• Written in Matlab

• Parallelized for a Linux cluster

• > 180 downloads via PUNCH

(4)

www.nanohub.purdue.edu

Software applications

PUNCH

workstations

middleware

web enabling

-network operating system -logical user accounts

-virtual file system

-resource management system

www.nanohub.purdue.edu

(5)

The cluster

•200 processors

•130 GFLOPS 2x1.2 GHz Athlon CPU, 1 GB RAM per node

(6)

Integrating a cluster with PUNCH

The user does not know that the job is being run on a cluster.

Install a compute Backend on the cluster

Jobs are scheduled to the cluster Backend through the ACTyP.

The backend submit the jobs to the PBS (Portable Batch System) of the cluster.

PUNCH can also interface with Condor (Job migration)

Backends could be installed on remote clusters and PUNCH jobs could be submitted through the PBS queue.

(7)

PUNCH runs jobs on the cluster !

•PBS monitors jobs on the cluster.

•Punch000 runs a job through PBS !!!

(8)

Parallelization

Code parallelization is done through message passing between processors

Message passing is implemented by MPI (message passing interface) or PVM (parallel virtual machine)

Break your problem in small problems. The problems are linked with MPI or PVM to get the full solution.

PVM home page (links and tutorial)

http://www.csm.ornl.gov/pvm

PVM home page at netlib (source code)

http://www.netlib.org/pvm3/

MPI home page

http://www.mpi-forum.org

(9)

Outline

Problem overview

nanoMOS 2.0

The cluster and PUNCH

Parallelization

MPITB and PVMTB

PVMTB basics

(10)

Parallel programming under Matlab

Matlab is not parallel ! But tools are available for

parallel computation on shared memory machines or distributed memory machines…(Matlab central file exchange)

Javier Fernandez Baldomero from the University of Granada/Spain has written PVMTB and MPITB.

C/C++, Mex functions bindings of MPI or PVM routines.

The best solution for PC-clusters, easy to use and a very good tool to learn parallelization under Matlab environment.

Production codes will be written with MPI or PVM !

(11)

Sending and receiving data with PVM

Buffer initialization needs to be done before packing anything

pvm_initsend(PLACE)

Packing can be done in various ways but matlab matrices are packed with

pvm_pack(A)

Sending can be directed a one particular matlab child or at all of them.

pvm_send(tid,TAG) or pvm_mcast(tids,TAG)

(12)

Sending and receiving data

Parent Children

A=eye(4); parent=pvm_parent;

pvm_initsend(PLACE) tid=pvm_mytid;

pvm_pack(A)

pvm_mcast(tids,TAG)

pvm_recv(parent,TAG) pvm_unpack(‘A’)

A=A+tid; %dummy operation

pvm_initsend(PLACE) pvm_pack(A)

pvm_send(parent,TAG) For numt=1:(nhost-1)

pvm_recv(tids(numt),TAG) pvm_unpack(‘A_new’) A_new

end

(13)

Outline

Problem overview

nanoMOS 2.0

The cluster

Parallelization

MPITB and PVMTB

PVMTB basics

(14)

NEGF Ballistic and with Scattering

z y

x z

y

x

E E r dE dE dE

E

∫∫∫ A ⁽ ^, ^, ^, ⁾

x y

y

x

E r dE dE

E

∫∫ A ⁽ ^, ^, ⁾

( )

∫ ^A

^Ballistic

^E

^x

^, ^r ^dE

^x

Mode space approach takes care of the

confinement direction

In the ballistic case,

contribution from the Ey’s is computed analytically

DOS is obtained by integrating the spectral function

?

(15)

Schematic view of Parallel nanoMOS

Master

Slaves

Pre-processing Poissson

1-D Schrodinger in the confinement direction

Compute their respective part of the DOS, send back the carrier density communication PVM

routines

(16)

Computational Cost on a single processor

CPU time measurements

0 100 200 300 400 500 600

1 2 3 4

Time (s) Series1

Series2 Series3

Estep=0.125 meV

Estep=0.25 meV Estep=0.5 meV

Estep=0.25 meV, all valleys

(17)

Parallelization results (Estep=0.25 meV)

(18)

Bottleneck= Computation/Communication

(19)

Serial vs. Parallel

0 100 200 300 400 500 600

Series1 Series2 Series3

Time (s)

One valley

All valleys

(20)

PVM Versus MPI !

Dashed line:PVM

Solid line:MPI

Blue: DOS is full matrix

Red: DOS is sparse

Green: no DOS

(21)

PVM versus MPI : Efficiency of non-self consistent loops

Blue: DOS is full matrix

Red: DOS is sparse

Green: no DOS

Dashed line:PVM

Solid line:MPI

(22)

Outline

Problem overview

nanoMOS 2.0

The cluster

Parallelization

MPITB and PVMTB

PVMTB basics

(23)

NEGF Ballistic and with Scattering

z y

x z

y

x

E E r dE dE dE

E

∫∫∫ A ⁽ ^, ^, ^, ⁾

x y

y

x

E r dE dE

E

∫∫ A ⁽ ^, ^, ⁾

( )

∫ ^A ^E ^, ^r ^dE

Mode space approach takes care of the

confinement direction

In the ballistic case,

contribution from the Ey’s is computed analytically

DOS is obtained by integrating the spectral function

∫∫

In the detailed scattering model, numerical integration in both

longitudinal and transverse energies is required

(24)

Results

Non self-consistent simulation was taking ~3-4 days on a single processor machine: 76 hours

Now it takes 2-3 hours on 40 processors.

Therefore self-consistent simulations are now possible (It would have taken ~40 days) in less than 36 hours.

45 minutes 120 CPUs

1.07 hours Speed-up 71x 80 CPUs

2.10 hours

Speed-up 36.2x 38 CPUs

38.65 hours

Speed-up 1.96x 2 CPUs

(25)

Conclusions

Gained expertise in parallelization techniques

PVM and MPI

Setup the PVMTB and MPITB toolkit for parallelization under Matlab

Tutorial for PVMTB is available

Wrote a parallel version of nanoMOS 2.0

Benchmarked performances and compared MPI with PVM.

Wrote a parallel detailed scattering model that gave us good physical insights about scattering phenomena…

Device Simulator nanoMOS 2.0 Using a 100 Nodes Linux Cluster

Parallelization of the Nanoscale