LibCudaOptimize  1.0
LibCudaOptimize Documentation
Youssef S.G. Nashed
Roberto Ugolotti


LibCudaOptimize is a GPU-based open source library that allows you to run state-of-the-art bio-inspired optimization heuristics in parallel to optimize a fitness function, introduce a new optimization algorithm, or easily modify/extend existing ones. In the first case, the only thing you need to do is to write your fitness to be optimized in C++ or CUDA-C, while in the second and third cases, you can take advantage of the framework offered by the library to avoid the need to go deep into basic implementation issues, especially regarding parallel code.

How To Install It

In order to install libCudaOptimize, the following steps should be followed:

  1. Download from the project webpage on Sourceforge ( the packages: libCudaOptimize, the proper library, and testLibCudaOptimize, a small program which tests if the installation has been correctly performed.
  2. Build the library from source code: for doing so, the easiest way is to use CMake. To build testLibCudaOptimize one has to link against the previously installed libCudaOptimize static library and the CUDA utilities library.

LibCudaOptimize has been tested on Windows 7 and Ubuntu Linux (from 10.04 to 12.04), using GPUs with CC from 1.3 to 3.0. It works under CUDA 4.X and 5.0 Please notice that GPUs with lower CC impose some limitations to the use of LibCudaOptimize due to hardware limits.

How To Use It

In this section, we will show two different scenarios the library can be easily applied. In the first one, one of the methods already implemented in the library is used to optimize a user's function. In the second one, a novel algorithm is implented, starting from the already implemented ones.

Case Use 1

The signature of your fitness function must be the one described in EvalFuncPtr. You should read the values from the SolutionSet and write the results of your functions in the FitnessSet. Even if this function needs to be host side, the effective computation can reside both in device and host. Just a few things must be changed.

This is a case in which the fitness computation takes place on host. If this is your case, remember to set setHostFitnessEvaluation(true) in your Optimizer.

void myFitness(const SolutionSet* setPtr, FitnessSet* fitnPtr, dim3 calculateFitnessGrid, dim3 calculateFitnessBlock)
  //create a temporary array that will contain fitnesses
  float* fitnesses = new float[setPtr->getSolutionNumber()];
  //do stuff
    fitnesses[i] = myHostFunction(setPtr->getHostPositionsConst(0, i); //0 is the number of set
  //write fitness values

In case the computation will happens on device (for instance, a CUDA kernel)

void myFitness(const SolutionSet* setPtr, FitnessSet* fitnPtr, dim3 calculateFitnessGrid, dim3 calculateFitnessBlock)
  myDeviceFunction(setPtr->getDevicePositionsConst(), fitnPtr->get(), setPtr->getActualSolutionSize());  

void myDeviceFunction(const float* pos, float* fitn, int dim)
    fitn[i] = doSomethingOn(pos[i*dim])

In both cases, you have to create your optimizer, in this case Particle Swarm Optimization:

void main() {
  PSO_Optimizer p(&myFitness,16, 1, 64);
  float* myResults = p.getBestSolution();

For more information, see testLibCudaOptimize on the SourceForge page.

Case Use 2

In this case, we will show how to start to create a new algorithm (a parallel version of Simulated Annealing), starting from the one already implemented in the library. This snippet should be used as a guideline for the advanced user to extend the library with new optimization techniques.

First of all, extend the IOptimizer implementing the required methods:

#include "IOptimizer.h"
class SA_Optimizer : public virtual IOptimizer
  SolutionSet m_neighbors;
  float m_temperature;
  virtual ~SA_Optimizer();
  virtual bool init();
  virtual void initSolutions(dim3,dim3);
  virtual void step(dim3,dim3);
  virtual void fitnessEvaluation(dim3,dim3, bool first=false);
  virtual void update(dim3,dim3, bool first=false);

The SA_Optimizer class has only two more member variables: a SolutionSet instance representing the neighbors of the current state in an iteration, and a float variable as the temperature of the system. The method initSolutions can be implemented using the initialization functions of DE_Optimizer and PSO_Optimizer classes, in which the positions of the population of solutions are set randomly over the search space. The only functions that should be implemented from scratch are step: and IOptimizer::update update", by specifying a neighborhood generating procedure and the acceptance probability function respectively. Each of these functions calls a CUDA-C kernel to execute the appropriate method in parallel, specifying the thread block configuration required. Every thread block performs the calculations for a solution in a set, while every thread in the block represents a dimension of the problem under optimization. Code snippets for the host and device functions for the step function, used for creating new candidate solutions that will be evaluated by fitnessEvaluation, are presented below:

void SA_Optimizer::step(dim3 updateGrid, dim3 updateBlock)
  h_cudaSA_step(m_solutionSet.getDevicePositions(), m_neighbours.getDevicePositions(), m_solutionSet.getSolutionNumber(), m_solutionSet.getProblemDimension(), m_solutionSet.getActualSolutionSize(), m_dBounds, m_randStates, updateGrid, updateBlock);
  m_temperature *= 0.98f;

h_cudaSA_step is a host function whose only purpose is to call the kernel sa_step, which can not be called directly from a C++ class.

__global__ void sa_step(const uint32_t solutionNumber, const uint32_t problemDimension, const uint32_t actualSolutionSize, float* positions, float* potentialPositions, float2* bounds, curandState* devStates)
  const uint32_t tid = threadIdx.x;
  const uint32_t solutionID = IMUL(blockIdx.x, gridDim.y) + blockIdx.y;
  const uint32_t posID = IMUL( solutionID, actualSolutionSize) + tid;
  curandState localState = devStates[posID];
  if(tid < problemDimension)
    float y = positions[posID] + (-0.1f + (curand_uniform(&localState) * 0.2f));
    cropPosition(bounds, y, blockIdx.x, tid, actualSolutionSize);
    potentialPositions[posID] = y;
  devStates[posID] = localState;

These algorithms usually test candidate positions sampled from a fixed or adaptive neighborhood around the current position. In our case, we generate candidate solutions by uniform sampling from a hypercube with side length of 0.2 and centered at the current solution. The CuRand library supplied with the CUDA SDK is used to generate random numbers, where, for every dimension of each solution a random number generator (curandState) is stored in a global memory array (devStates). First, the random number generator is loaded into a thread local variable for efficiency. Then, a new position y is computed for every dimension by adding a uniform random number in the range [-0.1, 0.1] to the current position. Finally, the neighbor position along with the random number generator are stored back in global memory at the end of the kernel.


If you use LibCudaOptimize in your paper, please refer to it with:

Y.S.G. Nashed, R. Ugolotti, P. Mesejo, S. Cagnoni (2012)
libCudaOptimize: an open source library of GPU-based metaheuristics
Proceedings of the 14th Genetic and Evolutionary Computation Conference (companion)

And please, inform us, possibly giving some feedback.

LibCudaOptimize has been used in the following papers:

Y.S.G. Nashed, P. Mesejo, R. Ugolotti, J. Dubois-Lacoste, S. Cagnoni (2012)
A comparative study of three GPU-based metaheuristics
Proceedings of the 12th International Conference on Parallel Problem Solving From Nature

For more information regarding our work, please visit Ibislab website.

 All Classes Files Functions Variables Enumerations Enumerator Defines