Challenge Two: N-Queens

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

Not sure whether the challenge is still open, but I've kept fiddling. This version posts a nice little speed-up:

Code:

#define NDEBUG 1

#include <algorithm>
#include <bitset>
#include <cassert>
#include <climits>
#include <cstdlib>
#include <iostream>
#include <iomanip>
#include <vector>

#include <math.h>

using std::bitset;
using std::clock;
using std::cout;
using std::endl;
using std::ostream;
using std::setw;
using std::transform;
using std::vector;

/* This template contains a generic implementation of a N-queens algorithm.
 * The compiler should be able to optimize for speed (at the expense of code
 * size, which is far less important) based on the fact that N is a constant.
 * For efficiency, we can plug in hand-optimized implementations for any
 * value of N.
 *
 * The algorithm assumes that an unsigned long has at least N bits.  Given
 * that it has at least 32 bits, and any computer capable of solving the
 * problem for N > 32 in any feasible amount of time will have at least a 64-
 * bit architecture, this seems safe.
 */

typedef unsigned long column_t;
// A column_t is a bitfield with at most a single bit set, in the bit
// position corresponding to the specified column.

// This functor performs a bitwise AND on its two predicates.
template <unsigned int N>
  struct bitwise_or
  {
    public:

    inline bitset<N> operator() ( const bitset<N> a, const bitset<N> b );
  };

template <unsigned int N>
  inline bitset<N> bitwise_or<N>::operator() ( const bitset<N> a,
                                               const bitset<N> b
                                             )
  {
    return bitset<N> ( a | b );
  }

template <unsigned int N>
  class board
  {
/* This object represents a N * N chessboard.  Internally, it maintains a
 * vector of bitfields, each of size N.  A modern compiler should be able to
 * vectorize operations on these data.  On most implementations, this should
 * fit each bitfield into the smallest appropriate integer type, permitting
 * a SIMD instruction such as POR.  This is probably not the absolutely most
 * compact representation (e.g., the 5x5 board will probably need 5 bytes
 * instead of 4), but it should generate appropriately-aligned code for
 * speed.
 *
 * A 1 in the position representing any square means that a queen definitely
 * threatens that square.  A 0 means that we do not know of any queen which
 * threatens that square.
 *
 * The bit positions are stored in row-major order, with the square on the
 * upper left represented by squares[0][0].
 */
    public:
// We should not need to override the implicit member functions.

    void clear (void);
// Wipes all queens off the board;
    void place_queen( unsigned row, column_t column_bit );
// Place a queen, updating the board as to which squares it attacks.
    column_t first_safe_space( unsigned row ) const;
// Returns the first column in the given row not under attack, or
// (1 << N) if no such column exists.
    column_t next_safe_space ( unsigned row,
                               column_t column_bit
                             ) const;
/* Returns either the first column to the right of the specified column
 * that is not under attack, or (1 << N) if no such column exists.
 */
    static void initialize_attack_matrix(void);
    static void destroy_attack_matrix(void);

    private:
    static vector< vector< bitset<N> > > attack_matrix;

    bitset<N> squares[N];
// Stores N * N bits representing the positions under attack on a N * N
// chessboard.
  };

template <unsigned int N>
  vector< vector< bitset<N> > > board<N>::attack_matrix;
/* Stores bitfields representing positions attacked by a queen in a given
 * square.  The place_queen() member function fills the individual matrix
 * entries in as needed.
 *
 * A queen in row i, column j corresponds to the vector in
 * attack_matrix[ i*N + j ].  There are no entries for the last row, N-1,
 * because they should never be needed.  Only the bitfields representing rows
 * i+1 through N-1 are stored.
 */

template <unsigned int N>
  void board<N>::clear( void )
// Wipes all queens off the board.
{
  for ( size_t i = 0; i < N; ++i )
    squares[i].reset();
}

template <unsigned int N>
  void board<N>::place_queen( unsigned row, column_t column_bit )
// Place a queen on the board, updating all rows below row to show which
// squares are under attack.
  {
/* One annoyance of representing column by a bitfield is that, to convert it
 * back to an index, we use floating-point math.  This is still a constant-
 * time operation on modern computers.  Furthermore, finding the next column
 * to search is also a constant-time operation.  Therefore, this method
 * beats linear search for large N.
 *
 * We aren't using large N, so whether this is really faster is doubtful.
 */
    const unsigned column =
static_cast<unsigned> ( ilogb( static_cast<double> (column_bit) ) );
    const ptrdiff_t i = N * row + column;
    static const bitwise_or<N> or_functor = bitwise_or<N>();

// If the test run passes all these sanity checks, it should be safe to
// recompile with NDEBUG, and run without them.
    assert( row >= 0 );
    assert( row < N - 1 );
    assert( column >= 0 );
    assert( column < N );
#ifndef NDEBUG
    {
      const bitset<sizeof(unsigned long) * CHAR_BIT> temp(column_bit);
      assert ( 1 == temp.count() );
    }
#endif

    const ptrdiff_t squares_to_fill = N - 1 - row;
    vector< bitset<N> > attacks;

    attacks.reserve( static_cast<size_t> (squares_to_fill) );

// If the requested square is already in the database, don't enter it again.
    if ( attack_matrix[i].empty() )
    {
// Find the highest square below us that is in the matrix, then fill upwards
// from there.
      ptrdiff_t j;
      unsigned long shift = 0;

      for ( j = i + N; j < N * N - N; j += N )
        if ( ! attack_matrix[j].empty() )
          {
            attacks = attack_matrix[j];
            shift = attacks.size();

            break;
          }

// If we fell through the loop, we are on the final row.  If we didn't, we
// are on the last filled square.  In either case, we want to move one square
// up.

      j -= N;

      {
/* Starting with the second-to-last row, fill in the attack matrix for the
 * square in that row directly beneath the one we want.
 */
        for ( ; j >= i; j -= N )
        {
          ++shift;

          const bitset<N> bottom_row = column_bit |
                                        column_bit << shift |
                                        column_bit >> shift;

          attacks.push_back(bottom_row);
          attack_matrix[j] = attacks;
        } // end for
      } // end if

      assert ( attacks.size() == (size_t) squares_to_fill );
    } // end if ( attack_matrix[i].empty() )
    else
      attacks = attack_matrix[i];

// Now that we have the attack matrix, AND it with the board.

   transform ( squares + 1 + row,
               squares + N,
               attacks.begin(),
               squares + 1 + row,
               or_functor
             );
  }

template <unsigned int N>
  column_t board<N>::first_safe_space( unsigned row ) const
  {
    const unsigned long to_return = squares[row].to_ulong();

    return static_cast<column_t> ( ~to_return & (to_return + 1) );
  }

template <unsigned int N>
  column_t board<N>::next_safe_space( unsigned row, column_t column_bit ) const
  {
    unsigned long to_return = squares[row].to_ulong();

#ifndef NDEBUG
    {
      bitset<sizeof(column_t) * CHAR_BIT> temp(column_bit);
      assert ( 1 == temp.count() );
    }
#endif

// Mask the columns up to column_bit.
    to_return |= column_bit;
    to_return |= column_bit - 1;

    return static_cast<column_t> ( ~to_return & (to_return + 1) );
  }

template <unsigned int N>
  void board<N>::initialize_attack_matrix (void)
  {
    attack_matrix.resize( (N - 1) * N );
  }

template <unsigned int N>
  void board<N>::destroy_attack_matrix (void)
  {
    attack_matrix.clear();
  }

#ifndef NDEBUG
template <unsigned int N>
  void print_board ( const board<N> &x )
{
  for ( size_t i = 0; i < N; ++i )
  {
    column_t t = x.first_safe_space(i);

    for ( column_t j = 1; j < 1 << N; j <<=1 )
    {
      if ( t == j )
      {
        cout << 'O';
        t = x.next_safe_space(i, j);
      }
      else
        cout << 'X';
    }
    cout << endl;
  }
  cout << endl;

  return ;
}
#endif

template <unsigned int N>
  long solve_queens(void)
  {
    long solutions = 0;
    column_t j[N-1];
    board<N> backtrack[N-1];

    board<N>::initialize_attack_matrix();

/* Find the solutions where the queen on the first row is on the left half
 * of the board.  For each such solution, there is a one-to-one corresp-
 * ondence with the solutions where that same queen is on the right half
 * of the board, and the solutions are reflections of each other.
 */
    for ( j[0] = 1; j[0] < 1 << (N/2); j[0] <<= 1 )
    {
      backtrack[0].clear();
      backtrack[0].place_queen( 0, j[0] );

      size_t current_row = 1;
      j[1] = backtrack[0].first_safe_space(1);

      while (true)
      {
        if ( 1 << N <= j[current_row] )
        {
// There are no (more) possible matches on this row.
          if ( 1 == current_row )
            break;

          --current_row;

          j[current_row] =
backtrack[current_row-1].next_safe_space( current_row, j[current_row] );
        }
        else if ( N - 2 == current_row )
        {
// Descend to the final level.
//
// Any open spaces we find here (there should be one at most) is a
// solution.  Furthermore, it has a symmetric solution.
          backtrack[N-2] = backtrack[N-3];
          backtrack[N-2].place_queen( N-2, j[N-2] );

          if ( backtrack[N-2].first_safe_space(N-1) < 1 << N )
            solutions += 2;

          j[N-2] = backtrack[N-3].next_safe_space( N-2, j[N-2] );
        }
        else
        {
// Descend.
          backtrack[current_row] = backtrack[current_row - 1];
          backtrack[current_row].place_queen(current_row, j[current_row] );

          ++current_row;
          j[current_row] =
backtrack[current_row-1].first_safe_space(current_row);
        } // end if
      } // end while
    } // end for

// If N is odd, we also need to check the middle column.  The two halves of
// the board on either side are symmetric, so we need only check one.
    if ( 1 == N % 2 )
    {
#ifndef NDEBUG
      j[0] = 1 << (N/2);
#endif

      backtrack[0].clear();
      backtrack[0].place_queen( 0, 1 << (N/2) );

// The second queen cannot be in the position diagonally below the first.
      for ( j[1] = 1; j[1] < 1 << (N/2 - 1); j[1] <<= 1 )
      {
        backtrack[1] = backtrack[0];
        backtrack[1].place_queen( 1, j[1] );

        size_t current_row = 2;
        j[2] = backtrack[1].first_safe_space(2);

        while (true)
        {
          if ( 1 << N <= j[current_row] )
          {
// There are no (more) possible matches on this row.
            if ( 2 == current_row )
              break;

            --current_row;

            j[current_row] =
backtrack[current_row-1].next_safe_space( current_row, j[current_row] );
          }
          else if ( N - 2 == current_row )
          {
// Descend to the final level.
//
// Any open spaces we find here (there should be one at most) is a
// solution.  Furthermore, it has a symmetric solution.

            backtrack[N-2] = backtrack[N-3];
            backtrack[N-2].place_queen( N-2, j[N-2] );

            if ( backtrack[N-2].first_safe_space(N-1) < 1 << N )
              solutions += 2;

            j[N-2] = backtrack[N-3].next_safe_space( N-2, j[N-2] );
          }
          else
          {
// Descend.
            backtrack[current_row] = backtrack[current_row - 1];
            backtrack[current_row].place_queen(current_row, j[current_row] );

            ++current_row;
            j[current_row] =
backtrack[current_row-1].first_safe_space(current_row);
          } // end if
        } // end while
      } // end for
    } // end if

    board<N>::destroy_attack_matrix();

    return solutions;
  }

// For the degenerate case N == 1, we need a trivial handler:
template <>
  long solve_queens<1> (void)
  {
    return 1;
  }

long queens( unsigned n )
{
  switch (n)
  {
    case  1: return solve_queens<1> (); break;
    case  2: return solve_queens<2> (); break;
    case  3: return solve_queens<3> (); break;
    case  4: return solve_queens<4> (); break;
    case  5: return solve_queens<5> (); break;
    case  6: return solve_queens<6> (); break;
    case  7: return solve_queens<7> (); break;
    case  8: return solve_queens<8> (); break;
    case  9: return solve_queens<9> (); break;
    case 10: return solve_queens<10>(); break;
    case 11: return solve_queens<11>(); break;
    case 12: return solve_queens<12>(); break;
    case 13: return solve_queens<13>(); break;
    case 14: return solve_queens<14>(); break;
    case 15: return solve_queens<15>(); break;
    case 16: return solve_queens<16>(); break;
    case 17: return solve_queens<17>(); break;

    default: return -1;
// If you're getting -1, expand the switch block.
  }
}

int main(void)
{
  static const unsigned start = 1;
//  static const unsigned finish = 17;
  static const unsigned finish = 14;
  clock_t begin, end;
  clock_t total = 0;

  unsigned i;

  for ( i = start; i <= finish; ++i )
  {
    long answer;

    begin = clock();
    answer = queens(i);
    end = clock();

    total += end - begin;

    cout << setw(8);
    cout << static_cast<double>( 1000 * (end - begin) ) / CLOCKS_PER_SEC;
    cout << " ms # of solutions for ";
    cout << setw(2) << i << " x ";
    cout << setw(2) << i << " = ";
    cout << answer << endl;
  }

  cout << "Total time: ";
  cout << static_cast<double>( 1000 * total ) / CLOCKS_PER_SEC;
  cout << " ms." << endl;

  return EXIT_SUCCESS;
}

The major difference is that it uses arrays rather than vectors within the board objects that form the arrays for backtracking.

Here are my results. I don't suggest going as high as 17 unless you have a lot of time on your hands.

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
      10 ms # of solutions for 11 x 11 = 2680
      60 ms # of solutions for 12 x 12 = 14200
     330 ms # of solutions for 13 x 13 = 73712
    1990 ms # of solutions for 14 x 14 = 365596
   12380 ms # of solutions for 15 x 15 = 2279184
   87110 ms # of solutions for 16 x 16 = 14772512
  631960 ms # of solutions for 17 x 17 = 95815104
Total time: 733840 ms.

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

For reference, here are my results with Frenchfrog's benchmark code:

Code:

         0 # of solutions for 2 x 2 = 0
         0 # of solutions for 3 x 3 = 0
         0 # of solutions for 4 x 4 = 2
         0 # of solutions for 5 x 5 = 10
         0 # of solutions for 6 x 6 = 4
         0 # of solutions for 7 x 7 = 40
         0 # of solutions for 8 x 8 = 92
         0 # of solutions for 9 x 9 = 352
         0 # of solutions for 10 x 10 = 724
        10 # of solutions for 11 x 11 = 2680
        50 # of solutions for 12 x 12 = 14200
       280 # of solutions for 13 x 13 = 73712
      1650 # of solutions for 14 x 14 = 365596
     11770 # of solutions for 15 x 15 = 2279184

Press a key

frenchfrog · **Joined:**Sun Jun 18, 2006 3:06 pm **Posts:**69

I like your use of templates to put N as a literal value.

frenchfrog · **Joined:**Sun Jun 18, 2006 3:06 pm **Posts:**69

Just for fun, 33% faster code.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <limits.h>

const int n = 15;

unsigned long nQueens2Medium( int n, int x0 )
{
	int row[sizeof(unsigned long) * 8];
	unsigned long rowMask[sizeof(unsigned long) * 8];
	unsigned long rowMaskRight[sizeof(unsigned long) * 8];
	unsigned long rowMaskLeft[sizeof(unsigned long) * 8];

	unsigned long nbSol = 0;
	int y = 2;
	int xstart = 0;

	row[0] = x0;

	unsigned long maskCol = 1UL << x0;

	rowMaskLeft[1] =  ( 1UL << ( x0 - 1 ) ) & ( ULONG_MAX >> 1UL );
	rowMaskRight[1] = ( 1UL << ( x0 + 1 ) ) & ( ULONG_MAX - 1UL );

	rowMask[1] = rowMaskLeft[1] | rowMaskRight[1] | maskCol;

	while ( y >= 2 )
	{
		for ( int x = xstart; x < n; x++ )
		{
			if ( !( rowMask[y - 1] & ( 1UL << ( row[y - 1] = x ) ) ) )
			{
				if ( y < n )
				{
					maskCol |= 1UL << x;

					rowMaskLeft[y] = ( rowMaskLeft[y - 1] >> 1UL ) | ( 1UL << ( row[y - 1] - 1 ) );
					rowMaskLeft[y] &= ( ULONG_MAX >> 1UL );

					rowMaskRight[y] = ( rowMaskRight[y - 1] << 1UL ) | ( 1UL << ( row[y - 1] + 1 ) );
					rowMaskRight[y] &= ( ULONG_MAX - 1UL );

					rowMask[y] = rowMaskLeft[y] | rowMaskRight[y] | maskCol;

					y++;

					x = -1; // to start the next for at 0
				}
				else
				{
					nbSol++;
					break;
				}
			}
		}
		y--;
		xstart = row[y - 1] + 1;
		maskCol ^= 1UL << row[y - 1];
	}

	return nbSol;
}

inline bool nQueens2LargeSafe( int *row, int x, int y )
{
	for ( int i = 1; i <= y; i++ )
	{
		int rowVal = row[y - i];
		if ( rowVal == x || rowVal == x - i || rowVal == x + i )
			return false;
	}
	return true;
}

unsigned long nQueens2Large( int n, int x0 )
{
	int *row = new int[n];

	unsigned long nbSol = 0;
	int y = 2;
	int xstart = 0;

	row[0] = x0;

	while ( y >= 2 )
	{
		for ( int x = xstart; x < n; x++ )
		{
			if ( nQueens2LargeSafe( row, row[y - 1] = x, y - 1 ) )
			{
				if ( y < n )
				{
					y++;
					x = -1; // to start the next for at 0
				}
				else
				{
					nbSol++;
					break;
				}
			}
		}
		y--;
		xstart = row[y - 1] + 1;
	}

	delete [] row;

	return nbSol;
}

unsigned long nQueens( int n )
{
	if ( n == 1 ) return 1;

	unsigned long nbSolLeft = 0, nbSolMid = 0;
	int limit = ( n - 1 ) >> 1;
	for ( int x = 0; x <= limit; x++ )
	{
		unsigned long nbSolTemp;
		if ( sizeof(unsigned long) * 8 >= n )
			nbSolTemp = nQueens2Medium( n, x );
		else
			nbSolTemp = nQueens2Large( n, x );

		if ( x == limit && ( n & 1 ) )
			nbSolMid += nbSolTemp;
		else
			nbSolLeft += nbSolTemp;
	}

	return ( nbSolLeft << 1 ) + nbSolMid;
}

int main()
{
	for ( int i = 1; i <= n; i++ )
	{
		clock_t init = clock();
		unsigned long nbSol = nQueens( i );
		clock_t final = clock();

		printf( "%10u # of solutions for %d x %d = %u\n", (unsigned int)( (final - init) * 1000 / CLOCKS_PER_SEC ), i, i, nbSol );
	}
	return 0;
}

Code:

         0 # of solutions for 1 x 1 = 1
         0 # of solutions for 2 x 2 = 0
         0 # of solutions for 3 x 3 = 0
         0 # of solutions for 4 x 4 = 2
         0 # of solutions for 5 x 5 = 10
         0 # of solutions for 6 x 6 = 4
         0 # of solutions for 7 x 7 = 40
         0 # of solutions for 8 x 8 = 92
         0 # of solutions for 9 x 9 = 352
         0 # of solutions for 10 x 10 = 724
         0 # of solutions for 11 x 11 = 2680
        16 # of solutions for 12 x 12 = 14200
       109 # of solutions for 13 x 13 = 73712
       609 # of solutions for 14 x 14 = 365596
      4188 # of solutions for 15 x 15 = 2279184
     26672 # of solutions for 16 x 16 = 14772512
    205250 # of solutions for 17 x 17 = 95815104

zig · PostPosted:Tue Aug 28, 2007 8:09 pm

My solution can only find distinct solutions, and it's pretty slow. I hope to make up some of the points on style and so forth, though.

The algorithm creates an array of unsigned longs. The index of each one is the Y value, and the value in each one is the X value. The recursive placement function makes sure that no queens are placed on the same column or row, which leaves only diagonal testing to be done by the validation function.

Code:

unsigned long solutions, size;
unsigned long *rows;

// Checks solutions for diagonal collisions
bool validate()
{
	for (unsigned int i = 0; i < size; i++)
	{
		for (unsigned int j = i + 1; j < size; j++)
		{
			// Basic X-Y diagonal stuff
			if (i + rows[i] == j + rows[j] || i - rows[i] == j - rows[j])
			{
				return false;
			}
		}
	}
	
	return true;
}

// Recursively places X-values into the appropriate Y-valued row
void position(unsigned long qidx)
{
	for (unsigned long i = 0; i < size; i++)
	{
		rows[qidx] = i;
		
		for (unsigned long j = 0; j < qidx; j++)
		{
			// Eliminate coinciding X-values
			if (rows[j] == i)
			{
				// I need to 'continue', but from the outer loop, not this one
				goto skip;
			}
		}

		if (qidx < size - 1)
		{
			// Recurse if not at the last recursion
			position(qidx + 1);
		}
		else
		{
			// Validate if at the last recursion
			if (validate())
			{
				solutions++;
			}
		}
		
	// Quasi-continue label
	skip:;
	}
}

// This is what gets called externally
unsigned long nqueens(unsigned long _size)
{
	solutions = 0;
	size = _size;
	
	// Create an array of precisely the right size so as to not waste memory
	rows = new unsigned long[size];
	
	// Call the first iteration of the recursive positioning function
	position(0);
	
	// No memory leaks!
	delete rows;
	
	return solutions;
}

At least my code is relatively short

zig · PostPosted:Tue Aug 28, 2007 8:27 pm

After 20 minutes of tweaking, I made it a ton faster by eliminating the validation function and essentially inlining the diagonal validation into the positioning loop. Not only does it appear to be faster than Tycho's algorithm, it's also shorter than the previous version of my algorithm.

Code:

unsigned long solutions, size;
unsigned long *rows;

// Recursively places X-values into the appropriate Y-valued row
void position(unsigned long qidx)
{
	for (unsigned long i = 0; i < size; i++)
	{
		rows[qidx] = i;
		
		for (unsigned long j = 0; j < qidx; j++)
		{
			// Eliminate coinciding X-values by column and diagonal tests
			if (rows[j] == i || qidx + rows[qidx] == j + rows[j] || qidx - rows[qidx] == j - rows[j])
			{
				// Excuse for goto: I need to 'continue', but from the outer loop, not this one
				goto skip;
			}
		}
		
		if (qidx < size - 1)
		{
			// Recurse if not at the last recursion
			position(qidx + 1);
		}
		else
		{
			// If it got here, it's a solution
			solutions++;
		}
		
	// Quasi-continue label
	skip:;
	}
}

// This is what gets called externally
unsigned long nqueens(unsigned long _size)
{
	solutions = 0;
	size = _size;
	
	// Create an array of precisely the right size so as to not waste memory
	rows = new unsigned long[size];
	
	// Call the first iteration of the recursive positioning function
	position(0);
	
	// No memory leaks!
	delete [] rows;
	
	return solutions;
}

EDIT: Fixed a small memory leak issue (i.e. I was an idiot). Should have no effect on performance or result.

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

Nice work, Frenchfrog! After some tweaking, I found that g++ generates better code when I internally represent the rows of the board with the bits wrapped in a hand-rolled structure, compared to using std::bitset, or just integers (either of the machine's native size or the minimal size). I'm not sure why. With this optimization, the code catches up to your previous submission when N = 17.

It's still catching up to your optimized program (yours is still 37% faster for N = 17, but it was 47% faster for N = 14 and has been declining since then). Computing N = 18 takes hours, and N = 19 days, so I'm not sure where the catch-up point is now.

Edit: Never mind! I found the problem that was slowing the program down so much! It keeps a lookup table of the lower-row positions
any queen attacks. Because of a logic error, it was copying the entries to a dynamically-allocated list in linear time. Now that it performs lookups in constant time, as it should, it's much faster.

Edit: Tested with Microsoft Visual C++, which is eight years behind the language standard. If you want to write Microsoft about this, be sure to mention: don't worry about the Y2K bug; sell stocks and buy real estate; and no matter how much you may dislike Bill Clinton, do not under any circumstances vote for George H. W. Bush's son.

This latest version compiles and runs correctly under MSVC in Win32 32-bit mode, much more slowly than under Linux x86_64. It also fixes a couple bugs that happened not to affect g++.

Edit: Tweaked the algorithm to eliminate the middle-column test for odd values of N. This cut the size of solve_queens() in half.

Edit: Isolated the parts of the code that use C99 extensions, and added feature test macros to enable or disable them.

Edit: Some final clean-up, to remove outdated comments and redundant code blocks.

Edit: Did I say, "final?" I meant, of course, "Microsoft's compiler crashes on preprocessor warnings. Adding insult to injury, it doesn't even print the warning."

Here are my current tweaks:

Code:

#define NDEBUG 1
#define _XOPEN_SOURCE 600

#include <algorithm>
#include <cassert>
#include <climits>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <iomanip>
#include <vector>

#if ( __GNUC__ >= 4 ) || ( _STDC_VERSION__ >= 199901L )
#define USE_C99_EXTENSIONS 1
#endif

#ifdef USE_C99_EXTENSIONS

#include <stdint.h>
#include <math.h>

#else

#include <cmath>

typedef unsigned long uint_fast32_t;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned long uint32_t;

using std::frexp;

inline int ilogb ( double x )
{
  int result;

  frexp( x, & result );

  return --result;
}

#endif

using std::clock;
using std::cout;
using std::endl;
using std::ostream;
using std::setw;
using std::transform;
using std::vector;

typedef uint_fast32_t uint_fast;

/* This template contains a generic implementation of a N-queens algorithm.
 * The compiler should be able to optimize for speed (at the expense of code
 * size, which is far less important) based on the fact that N is a constant.
 * For efficiency, we can plug in hand-optimized implementations for any
 * value of N.
 *
 * The algorithm assumes that an unsigned long has at least N bits.  Given
 * that it has at least 32 bits, and any computer capable of solving the
 * problem for N > 32 in any feasible amount of time will have at least a 64-
 * bit architecture, this seems safe.
 *
 * The largest problem with this program was its choice of data types; the
 * current solution is to make bitfield a template parameter.  It should be
 * an unsigned integral type, although you might see whether unsigned long
 * is faster on your architecture.
 */

template < typename T >
  class bitwise_or
  {
    // A binary functor, intended for bitfields.
    public:

    inline T operator() ( const T& a, const T& b )
    {
      return (a | b);
    }
  };

template < unsigned int N, typename bitfield >
  class board
  {
/* This object represents a N * N chessboard.  Internally, it maintains a
 * vector of bitfields, each of size N.  A modern compiler should be able to
 * vectorize operations on these data.  On most implementations, this should
 * fit each bitfield into the smallest appropriate integer type, permitting
 * a SIMD instruction such as POR.  This is probably not the absolutely most
 * compact representation (e.g., the 5x5 board will probably need 5 bytes
 * instead of 4), but it should generate appropriately-aligned code for
 * speed.
 *
 * A 1 in the position representing any square means that a queen definitely
 * threatens that square.  A 0 means that we do not know of any queen which
 * threatens that square.
 *
 * The bit positions are stored in row-major order, with the square on the
 * upper left represented by squares[0][0].
 */
    public:

    class internal
    {
      public:

      inline internal(void) : bits(0U) {};
      inline internal( bitfield x ) : bits(x) {};
      inline operator uint_fast(void) const { return (uint_fast)bits; }
      inline const internal operator| ( const internal& x ) const
        { return internal ( bitfield (bits | x.bits) ); }

      private:

      bitfield bits;

      friend class board;
    };

    typedef vector< vector< internal > > attack_matrix_t;

// We should not need to override the implicit member functions.

    void clear(void);
// Wipes all queens off the board;
    void place_queen( uint_fast row, uint_fast column_bit );
// Place a queen, updating the board as to which squares it attacks.
    uint_fast first_safe_space( uint_fast row ) const;
// Returns the first column in the given row not under attack, or
// a number out of range if no such column exists.
    uint_fast next_safe_space ( uint_fast row,
                                uint_fast column_bit
                              ) const;
/* Returns either the first column to the right of the specified column
 * that is not under attack, or a number out of range if no such column
 * exists.
 */
    static void initialize_attack_matrix(void);
    static void destroy_attack_matrix(void);

    private:

    static attack_matrix_t attack_matrix;

    internal squares[N];

// Stores N * N bits representing the positions under attack on a N * N
// chessboard.
  };

template < unsigned int N, typename bitfield >
   typename board<N, bitfield>::attack_matrix_t board< N, bitfield >::attack_matrix;

/* Stores bitfields representing positions attacked by a queen in a given
 * square.  The place_queen() member function fills the individual matrix
 * entries in as needed.
 *
 * A queen in row i, column j corresponds to the vector in
 * attack_matrix[ i*N + j ].  There are no entries for the last row, N-1,
 * because they should never be needed.  Only the bitfields representing rows
 * i+1 through N-1 are stored.
 */

template < unsigned int N, typename bitfield >
  void board< N, bitfield >::clear( void )
// Wipes all queens off the board.
{
  for ( size_t i = 0; i < N; ++i )
    squares[i].bits = 0;
}

template < unsigned int N, typename bitfield >
  void board< N, bitfield >::place_queen( uint_fast row,
                                          uint_fast column_bit
                                        )
// Place a queen on the board, updating all rows below row to show which
// squares are under attack.
  {
/* One annoyance of representing column by a bitfield is that, to convert it
 * back to an index, we use floating-point math.  This is still a constant-
 * time operation on modern computers.  Furthermore, finding the next column
 * to search is also a constant-time operation.  Therefore, this method
 * beats linear search for large N.
 *
 * We aren't using large N, so whether this is really faster is doubtful.
 */
    const uint_fast column =
static_cast<uint_fast> ( ilogb( static_cast<double> (column_bit) ) );
    const ptrdiff_t i = N * row + column;
    static const bitwise_or<internal> or_functor = bitwise_or<internal>();

// If the test run passes all these sanity checks, it should be safe to
// recompile with NDEBUG, and run without them.
    assert( row >= 0 );
    assert( row < N - 1 );
    assert( column >= 0 );
    assert( column < N );

// If the requested square is already in the database, don't enter it again.
    if ( attack_matrix[i].empty() )
    {
      vector< internal > attacks;

      attacks.reserve( static_cast<size_t>(N - 1 - row) );

// Find the highest square below us that is in the matrix, then fill upwards
// from there.
      ptrdiff_t j;
      uint_fast shift = 0;

      for ( j = i + N; j < N * N - N; j += N )
        if ( ! attack_matrix[j].empty() )
          {
            attacks = attack_matrix[j];
            shift = static_cast<uint_fast> (attacks.size());

            break;
          }

// If we fell through the loop, we are on the final row.  If we didn't, we
// are on the last filled square.  In either case, we want to move one square
// up.

/* Starting with the second-to-last row, fill in the attack matrix for the
 * square in that row directly beneath the one we want.
 */
     for ( j -= N; j >= i; j -= N )
     {
       ++shift;

       const uint_fast bottom_row ( column_bit |
                                    column_bit << shift |
                                    column_bit >> shift
                                  );

       attacks.push_back( internal(static_cast<bitfield>(bottom_row) ) );
       attack_matrix[j] = attacks;
     }

// Now that we have the attack matrix, AND it with the board.

     transform ( (&squares[0]) + 1 + row,
                 (&squares[0]) + N,
                 &attacks[0],
                 (&squares[0]) + 1 + row,
                 or_functor
               );

      assert ( attacks.size() == N - 1 - row; );
    } // end if ( attack_matrix[i].empty() )
    else
    {
     transform ( (&squares[0]) + 1 + row,
                 (&squares[0]) + N,
                 &(attack_matrix[i][0]),
                 (&squares[0]) + 1 + row,
                 or_functor
               );
    }
  }

template < unsigned int N, typename bitfield >
  uint_fast board< N, bitfield >::first_safe_space( uint_fast row ) const
  {
    const uint_fast to_return = squares[row];

    return ( ~to_return & (to_return + 1) );
  }

template < unsigned int N, typename bitfield >
  uint_fast board< N, bitfield >::next_safe_space( uint_fast row,
                                                   uint_fast column_bit
                                                 ) const
  {
    uint_fast to_return = squares[row];

// Mask the columns up to column_bit.
    to_return |= column_bit;
    to_return |= column_bit - 1;

    return ( ~to_return & (to_return + 1) );
  }


template < unsigned int N, typename bitfield >
  void board< N, bitfield >::initialize_attack_matrix (void)
  {
    attack_matrix.resize( (N - 1) * N );
  }

template < unsigned int N, typename bitfield >
  void board< N, bitfield >::destroy_attack_matrix (void)
  {
    attack_matrix.clear();
  }
 

template < unsigned int N, typename bitfield >
  long solve_queens(void)
  {
    long solutions = 0;
    uint_fast j[N-1];
    board< N, bitfield > backtrack[N-1];

    board< N, bitfield >::initialize_attack_matrix();

/* Find the solutions where the queen on the first row is on the left half
 * of the board.  For each such solution, there is a one-to-one corresp-
 * ondence with the solutions where that same queen is on the right half
 * of the board, and the solutions are reflections of each other.
 *
 * If N is odd, any solution with the last queen in the center column will
 * have at least three counterparts: its horizontal reflection, its vertical
 * reflection, and its horizontal and vertical reflection.  Any solution with
 * the first queen in that column would be one of the last two.  Therefore,
 * we can eliminate the loop to test the center column.
 */
    for ( j[0] = 1; j[0] < 1 << (N >> 1); j[0] <<= 1 )
    {
      backtrack[0].clear();
      backtrack[0].place_queen( 0, j[0] );

      uint_fast current_row = 1;
      j[1] = backtrack[0].first_safe_space(1);

      while (true)
      {
        if ( 1 << N <= j[current_row] )
        {
// There are no (more) possible matches on this row.
          if ( 1 == current_row )
            break;

          --current_row;

          j[current_row] =
backtrack[current_row-1].next_safe_space( current_row, j[current_row] );
        }
        else if ( N - 2 == current_row )
        {
// Descend to the final level.
//
// Any open spaces we find here (there should be one at most) is a
// solution.  Furthermore, it has a symmetric solution.
          backtrack[N-2] = backtrack[N-3];
          backtrack[N-2].place_queen( N-2, j[N-2] );

          const uint_fast k = backtrack[N-2].first_safe_space(N-1);

          if ( k < 1 << N )
        {
          if ( N % 2 == 1 && k == 1 << (N >> 1) )
// Our last queen is in the center column.
            solutions += 4;
          else
            solutions += 2;
        }

          j[N-2] = backtrack[N-3].next_safe_space( N-2, j[N-2] );
        }
        else
        {
// Descend.
          backtrack[current_row] = backtrack[current_row - 1];
          backtrack[current_row].place_queen(current_row, j[current_row] );

          ++current_row;
          j[current_row] =
backtrack[current_row-1].first_safe_space(current_row);
        } // end if
      } // end while
    } // end for

    board< N, bitfield >::destroy_attack_matrix();

    return solutions;
  }

// For the degenerate case N <= 2, we need a trivial handler:
template < >
  long solve_queens< 1, uint8_t > (void)
  {
    return 1;
  }

template < >
  long solve_queens< 2, uint8_t > (void)
  {
    return 0;
  }

long queens( unsigned n )
{
  switch (n)
  {
    case  1: return solve_queens<  1, uint8_t  > (); break;
    case  2: return solve_queens<  2, uint8_t  > (); break;
    case  3: return solve_queens<  3, uint8_t  > (); break;
    case  4: return solve_queens<  4, uint8_t  > (); break;
    case  5: return solve_queens<  5, uint8_t  > (); break;
    case  6: return solve_queens<  6, uint8_t  > (); break;
    case  7: return solve_queens<  7, uint8_t  > (); break;
    case  8: return solve_queens<  8, uint8_t  > (); break;
    case  9: return solve_queens<  9, uint16_t > (); break;
    case 10: return solve_queens< 10, uint16_t > (); break;
    case 11: return solve_queens< 11, uint16_t > (); break;
    case 12: return solve_queens< 12, uint16_t > (); break;
    case 13: return solve_queens< 13, uint16_t > (); break;
    case 14: return solve_queens< 14, uint16_t > (); break;
    case 15: return solve_queens< 15, uint16_t > (); break;
    case 16: return solve_queens< 16, uint16_t > (); break;
    case 17: return solve_queens< 17, uint32_t > (); break;
    case 18: return solve_queens< 18, uint32_t > (); break;

    default: return -1;
// If you're getting -1, expand the switch block.
  }
}

int main(void)
{
  static const unsigned start = 1;
//  static const unsigned finish = 14;
  static const unsigned finish = 17;

  clock_t begin, end;
  clock_t total = 0;

  unsigned i;

  for ( i = start; i <= finish; ++i )
  {
    long answer;

    begin = clock();
    answer = queens(i);
    end = clock();

    total += end - begin;

    cout << setw(8);
    cout << static_cast<double>( 1000 * (end - begin) ) / CLOCKS_PER_SEC;
    cout << " ms # of solutions for ";
    cout << setw(2) << i << " x ";
    cout << setw(2) << i << " = ";
    cout << answer << endl;
  }

  cout << "Total time: ";
  cout << static_cast<double>( 1000 * total ) / CLOCKS_PER_SEC;
  cout << " ms." << endl;

  return EXIT_SUCCESS;
}

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

For fun, here is the fastest program I've ever written to print the solutions to the N-queens puzzle up to N = 25:

Code:

#include <stdio.h>
#include <stdlib.h>

int main (void)
{
  puts("1\n0\n0\n2\n10\n4\n40\n92\n352\n724\n2680\n14200\n73712\n"
         "365596\n2279184\n14772512\n95815104\n666090624\n"
         "4968057848\n39029188884\n314666222712\n"
         "2691008701644\n24233937684440\n"
         "227514171973736\n2207893435808352\n"
        );

  return EXIT_SUCCESS;
}

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

The results of my latest benchmark:

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
       0 ms # of solutions for 11 x 11 = 2680
      30 ms # of solutions for 12 x 12 = 14200
     120 ms # of solutions for 13 x 13 = 73712
     770 ms # of solutions for 14 x 14 = 365596
    4530 ms # of solutions for 15 x 15 = 2279184
   30110 ms # of solutions for 16 x 16 = 14772512
  218830 ms # of solutions for 17 x 17 = 95815104
Total time: 254390 ms.

Edit: Updated to the current version of the program, posted above. (My real program, of course, not the joke.)

And with Frenchfrog's latest program:

Code:

         0 # of solutions for 1 x 1 = 1
         0 # of solutions for 2 x 2 = 0
         0 # of solutions for 3 x 3 = 0
         0 # of solutions for 4 x 4 = 2
         0 # of solutions for 5 x 5 = 10
         0 # of solutions for 6 x 6 = 4
         0 # of solutions for 7 x 7 = 40
         0 # of solutions for 8 x 8 = 92
         0 # of solutions for 9 x 9 = 352
         0 # of solutions for 10 x 10 = 724
         0 # of solutions for 11 x 11 = 2680
        30 # of solutions for 12 x 12 = 14200
       180 # of solutions for 13 x 13 = 73712
      1020 # of solutions for 14 x 14 = 365596
      7200 # of solutions for 15 x 15 = 2279184
     47040 # of solutions for 16 x 16 = 14772512
    369090 # of solutions for 17 x 17 = 95815104

And here's Zig's. I re-used the timing loop from my program to call his. I'm afraid I didn't wait for it to compute the solution for N = 17.

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
      10 ms # of solutions for 10 x 10 = 724
      40 ms # of solutions for 11 x 11 = 2680
     250 ms # of solutions for 12 x 12 = 14200
    1540 ms # of solutions for 13 x 13 = 73712
   10210 ms # of solutions for 14 x 14 = 365596
   70820 ms # of solutions for 15 x 15 = 2279184
  518730 ms # of solutions for 16 x 16 = 14772512

Edit: Last time you tried replicating my benchmarks, Frenchfrog, we were surprised by how much slower my program ran on your machine. Here is what I get when I reboot the same computer into 32-bit Windows XP and run the same benchmark:

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
      16 ms # of solutions for 11 x 11 = 2680
      78 ms # of solutions for 12 x 12 = 14200
     359 ms # of solutions for 13 x 13 = 73712
    2438 ms # of solutions for 14 x 14 = 365596
   14094 ms # of solutions for 15 x 15 = 2279184
  102125 ms # of solutions for 16 x 16 = 14772512
  665344 ms # of solutions for 17 x 17 = 95815104
Total time: 784454 ms.
Press any key to continue . . .

I don't think that this is merely glibc misreporting the value of clock(). And I only bash Microsoft when the company deserves it: almost none of that time was spent executing OS code, so the issue also isn't Windows being inefficient. The same program runs close to three times faster on 64-bit Linux. A lot of that speed is probably due to the x86_64 architecture, but g++ also seems to be doing a better job of optimizing my program.

Edit: Tested that hypothesis by compiling in g++ 3.4.4 in Cygwin, and got:

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
      16 ms # of solutions for 10 x 10 = 724
      31 ms # of solutions for 11 x 11 = 2680
     172 ms # of solutions for 12 x 12 = 14200
     891 ms # of solutions for 13 x 13 = 73712
    5656 ms # of solutions for 14 x 14 = 365596
   33297 ms # of solutions for 15 x 15 = 2279184
  234890 ms # of solutions for 16 x 16 = 14772512

frenchfrog · **Joined:**Sun Jun 18, 2006 3:06 pm **Posts:**69

First, thanks for providing all those numbers

In clear?

1) 32 bit ?MS Studio 2005? is ~2x faster than 32 bit gcc 3.4.4 in this _benchmark_.
2) 64 bit gcc is 3x faster than 32 bit ?MS Studio 2005? in this _benchmark_.

Which version of gcc do you use on Linux?

Does compiling with that version of gcc on Linux to a 32 bit target change the numbers compare to g++ 3.4.4 in Cygwin/Windows?

EDIT: I would also like to say that I like your first_safe_space/next_safe_space functions, saw them in your first code version and it's brillant

EDIT EDIT: My code 3 times faster stealing your first_safe_space/next_safe_space functions:

EDIT EDIT EDIT: I guess what's make my code faster is that it doesn't need ilogb, it works directly with _bitfields_.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <limits.h>

// 2 functions from Lorehead code.
unsigned long firstSafeSpace( unsigned long *rowMask, int y )
{
	unsigned long to_return = rowMask[y - 1];

	return ( ~to_return & (to_return + 1) );
}

unsigned long nextSafeSpace( unsigned long *rowMask, int y, unsigned long x )
{
	unsigned long to_return = rowMask[y - 1];

	to_return |= x;
	to_return |= x - 1;

	return ( ~to_return & (to_return + 1) );
}

unsigned long nQueens2Medium( int n, unsigned long x0 )
{
	unsigned long row[sizeof(unsigned long) * 8];
	unsigned long rowMask[sizeof(unsigned long) * 8];
	unsigned long rowMaskRight[sizeof(unsigned long) * 8];
	unsigned long rowMaskLeft[sizeof(unsigned long) * 8];

	unsigned long nbit = 1UL << ( n - 1 );

	unsigned long nbSol = 0;
	unsigned long maskCol = x0;

	int y = 2;

	row[0] = x0;

	rowMaskLeft[1] =  x0 >> 1;
	rowMaskRight[1] = x0 << 1;

	rowMask[1] = rowMaskLeft[1] | rowMaskRight[1] | maskCol;

	unsigned long xstart = firstSafeSpace( rowMask, y );

	while ( y >= 2 )
	{
		for ( unsigned long x = xstart; x <= nbit; )
		{
			if ( !( rowMask[y - 1] & ( row[y - 1] = x ) ) )
			{
				if ( y < n )
				{
					maskCol |= x;

					rowMaskLeft[y] = ( rowMaskLeft[y - 1] >> 1 ) | ( x >> 1 );

					rowMaskRight[y] = ( rowMaskRight[y - 1] << 1 ) | ( x << 1 );

					rowMask[y] = rowMaskLeft[y] | rowMaskRight[y] | maskCol;

					y++;

					x = firstSafeSpace( rowMask, y ); // to start the next for at 0
					continue;
				}
				else
				{
					nbSol++;
					break;
				}
			}
			x = nextSafeSpace( rowMask, y, x );
		}
		y--;
		xstart = nextSafeSpace( rowMask, y, row[y - 1] );
		maskCol ^= row[y - 1];
	}

	return nbSol;
}

inline bool nQueens2LargeSafe( int *row, int x, int y )
{
	for ( int i = 1; i <= y; i++ )
	{
		int rowVal = row[y - i];
		if ( rowVal == x || rowVal == x - i || rowVal == x + i )
			return false;
	}
	return true;
}

unsigned long nQueens2Large( int n, int x0 )
{
	int *row = new int[n];

	unsigned long nbSol = 0;
	int y = 2;
	int xstart = 0;

	row[0] = x0;

	while ( y >= 2 )
	{
		for ( int x = xstart; x < n; x++ )
		{
			if ( nQueens2LargeSafe( row, row[y - 1] = x, y - 1 ) )
			{
				if ( y < n )
				{
					y++;
					x = -1; // to start the next for at 0
				}
				else
				{
					nbSol++;
					break;
				}
			}
		}
		y--;
		xstart = row[y - 1] + 1;
	}

	delete [] row;

	return nbSol;
}

unsigned long nQueens( int n )
{
	if ( n == 1 ) return 1;

	unsigned long nbSolLeft = 0, nbSolMid = 0;
	int limit = ( n - 1 ) >> 1;
	for ( int x = 0; x <= limit; x++ )
	{
		unsigned long nbSolTemp;
		if ( sizeof(unsigned long) * 8 > n )
			nbSolTemp = nQueens2Medium( n, 1UL << x );
		else
			nbSolTemp = nQueens2Large( n, x );

		if ( x == limit && ( n & 1 ) )
			nbSolMid += nbSolTemp;
		else
			nbSolLeft += nbSolTemp;
	}

	return ( nbSolLeft << 1 ) + nbSolMid;
}

int main()
{
	const int n = 17;
	for ( int i = 1; i <= n; i++ )
	{
		clock_t init = clock();
		unsigned long nbSol = nQueens( i );
		clock_t final = clock();

		printf( "%10u # of solutions for %d x %d = %u\n", (unsigned int)( (final - init) * 1000 / CLOCKS_PER_SEC ), i, i, nbSol );
	}
	return 0;
}

Code:

         0 # of solutions for 1 x 1 = 1
         0 # of solutions for 2 x 2 = 0
         0 # of solutions for 3 x 3 = 0
         0 # of solutions for 4 x 4 = 2
         0 # of solutions for 5 x 5 = 10
         0 # of solutions for 6 x 6 = 4
         0 # of solutions for 7 x 7 = 40
         0 # of solutions for 8 x 8 = 92
         0 # of solutions for 9 x 9 = 352
         0 # of solutions for 10 x 10 = 724
         0 # of solutions for 11 x 11 = 2680
         0 # of solutions for 12 x 12 = 14200
        47 # of solutions for 13 x 13 = 73712
       203 # of solutions for 14 x 14 = 365596
      1329 # of solutions for 15 x 15 = 2279184
      8296 # of solutions for 16 x 16 = 14772512
     62094 # of solutions for 17 x 17 = 95815104

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

I'll see your loop and raise you one perfect hash function!

Edit: And replace the attack matrix with something static and sane. Apparently, that wasn't a huge bottlenect, but it did make a noticeable difference for higher values of N. Better refactor the loop.

Code:

#define NDEBUG 1
#define _XOPEN_SOURCE 600

#include <algorithm>
#include <cassert>
#include <climits>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <iomanip>
#include <vector>

#if ( __GNUC__ >= 4 ) || ( _STDC_VERSION__ >= 199901L )
#define USE_C99_EXTENSIONS 1
#endif

#ifdef USE_C99_EXTENSIONS

#include <stdint.h>

#else

typedef unsigned long uint_fast32_t;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned long uint32_t;

#endif

using std::clock;
using std::cout;
using std::endl;
using std::ostream;
using std::setw;
using std::transform;
using std::vector;

typedef uint_fast32_t uint_fast;

const unsigned start = 1;
// const unsigned finish = 14;
static const unsigned finish = 17;

/* I stand by my earlier comment: because ilogb is a constant-time operation,
 * it beats linear search in theory.  It's still too slow.  I need a better,
 * constant-time function.  I need a hash table.
 *
 * Unfortunately, hash tables are in Technical Report 1 of the ANSI/ISO C++
 * standard, which has yet to be formally adopted.  So, I have to roll my own.
 */

class hash_2n
{
/* This class represents a hash table whose keys are powers of 2, between 1
 * and 2**(finish-1), and whose data are the outputs of the function:
 * lg(key) + 1.  The lg notation comes from Donald Knuth.  (Aren't you glad I
 * dropped his name?)
 *
 * Anyhow.  This class is intended as a singleton, since there's no rational
 * motive to have more than one.  It does not have collision handling,
 * insertion, or more than minimal features. Such as it does have follow the
 * example of map.
 */
  public:

  hash_2n (void);
  inline uint_fast operator[] ( uint_fast exponent ) const;

  private:

  static const uint_fast hash_value = 37;
  static uint_fast table[hash_value];
};

hash_2n::hash_2n (void)
{
  for ( uint_fast i = 0; i < 17; ++i )
  {
	const uint_fast hash = ( 1 << i ) % hash_value;

    assert( 0 == table[hash] );
	table[hash] = i;
  }

  return;
}

inline uint_fast hash_2n::operator[] ( uint_fast exponent ) const
{
  return table[ exponent % hash_value ];
}

uint_fast hash_2n::table [hash_value];

const hash_2n column_hash;
// The one and only instance of this class in the program.

/* This template contains a generic implementation of a N-queens algorithm.
 * The compiler should be able to optimize for speed (at the expense of code
 * size, which is far less important) based on the fact that N is a constant.
 * For efficiency, we can plug in hand-optimized implementations for any
 * value of N.
 *
 * The algorithm assumes that an unsigned long has at least N bits.  Given
 * that it has at least 32 bits, and any computer capable of solving the
 * problem for N > 32 in any feasible amount of time will have at least a 64-
 * bit architecture, this seems safe.
 *
 * The largest problem with this program was its choice of data types; the
 * current solution is to make bitfield a template parameter.  It should be
 * an unsigned integral type, although you might see whether unsigned long
 * is faster on your architecture.
 */

template < typename T >
  class bitwise_or
  {
    // A binary functor, intended for bitfields.
    public:

    inline T operator() ( const T& a, const T& b )
    {
      return (a | b);
    }
  };

template < unsigned int N >
  class board
  {
/* This object represents a N * N chessboard.  Internally, it maintains a
 * vector of bitfields, each of size N.  A modern compiler should be able to
 * vectorize operations on these data.  On most implementations, this should
 * fit each bitfield into the smallest appropriate integer type, permitting
 * a SIMD instruction such as POR.  This is probably not the absolutely most
 * compact representation (e.g., the 5x5 board will probably need 5 bytes
 * instead of 4), but it should generate appropriately-aligned code for
 * speed.
 *
 * A 1 in the position representing any square means that a queen definitely
 * threatens that square.  A 0 means that we do not know of any queen which
 * threatens that square.
 *
 * The bit positions are stored in row-major order, with the square on the
 * upper left represented by squares[0][0].
 */
    public:
// We should not need to override the implicit member functions.

    void clear(void);
// Wipes all queens off the board;
    void place_queen( uint_fast row, uint_fast column_bit );
// Place a queen, updating the board as to which squares it attacks.
    uint_fast first_safe_space( uint_fast row ) const;
// Returns the first column in the given row not under attack, or
// a number out of range if no such column exists.
    uint_fast next_safe_space ( uint_fast row,
                                uint_fast column_bit
                              ) const;
/* Returns either the first column to the right of the specified column
 * that is not under attack, or a number out of range if no such column
 * exists.
 */

    private:

    uint32_t squares[N];
// Stores N * N bits representing the positions under attack on a N * N
// chessboard.
  };

//template < unsigned int N, typename bitfield >
//   typename board<N, bitfield>::attack_matrix_t board< N, bitfield >::attack_matrix;

// It's time to get serious.  The old algorithm just wasn't cutting it.

class attack_matrix_t
{
  public:

  attack_matrix_t (void);
  inline const uint_fast* operator[] ( size_t column ) const;

  private:

  static uint32_t the_matrix [finish][finish - 1];
/* Stores bitfields representing positions attacked by a queen in a given
 * column.  The entries are bitmasks to apply to the N - 1 squares below.
 */
};

uint_fast attack_matrix_t::the_matrix [finish][finish - 1];

attack_matrix_t::attack_matrix_t (void)
{
  for ( size_t i = 0; i < finish; ++i )
  {
    const uint_fast rook = 1 << i;

	for ( size_t j = 0; j < finish - 1; ++j )
	{
      const uint_fast kings_bishop = rook << (j + 1), queens_bishop = rook >> (j + 1);

	  the_matrix[i][j] = static_cast<uint32_t> ( rook | kings_bishop | queens_bishop );
	}
  }
}

inline const uint_fast* attack_matrix_t::operator[] ( size_t column ) const
{
  return (const uint_fast*) ( the_matrix + column );
}

const attack_matrix_t attack_matrix;
// The only instance of this object in the program.

template < unsigned int N >
  void board<N>::clear( void )
// Wipes all queens off the board.
{
  for ( size_t i = 0; i < N; ++i )
        squares[i] = 0;
}

template < unsigned int N >
  void board<N>::place_queen( uint_fast row, uint_fast column_bit )
// Place a queen on the board, updating all rows below row to show which
// squares are under attack.
  {
// Previously used ilogb(), which was non-portable and incredibly slow.
    const uint_fast column = column_hash[column_bit];
    static const bitwise_or<uint32_t> or_functor = bitwise_or<uint32_t>();

// If the test run passes all these sanity checks, it should be safe to
// recompile with NDEBUG, and run without them.
    assert( row >= 0 );
    assert( row < N - 1 );
    assert( column >= 0 );
    assert( column < N );

    transform ( (&squares[0]) + 1 + row,
                (&squares[0]) + N,
                attack_matrix[column],
                (&squares[0]) + 1 + row,
                or_functor
              );

	return;
  }

template < unsigned int N >
  uint_fast board<N>::first_safe_space( uint_fast row ) const
  {
    const uint_fast to_return = squares[row];

    return ( ~to_return & (to_return + 1) );
  }

template <unsigned int N>
  uint_fast board<N>::next_safe_space( uint_fast row,
                                       uint_fast column_bit
                                     ) const
  {
    uint_fast to_return = squares[row];

// Mask the columns up to column_bit.
    to_return |= column_bit;
    to_return |= column_bit - 1;

    return ( ~to_return & (to_return + 1) );
  }


template <unsigned int N>
  long solve_queens(void)
  {
    long solutions = 0;
    uint_fast j[N-1];
    board<N> backtrack[N-1];

// This loop may soon be refactored:
    for ( j[0] = 1; j[0] < 1 << (N >> 1); j[0] <<= 1 )
    {
      backtrack[0].clear();
      backtrack[0].place_queen( 0, j[0] );

      uint_fast current_row = 1;

      j[1] = backtrack[0].first_safe_space(1);

      while (true)
      {
        if ( 1 << N <= j[current_row] )
        {
// There are no (more) possible matches on this row.
          if ( 1 == current_row )
            break;

          --current_row;

          j[current_row] =
backtrack[current_row-1].next_safe_space( current_row, j[current_row] );
        }
        else if ( N - 2 == current_row )
        {
// Descend to the final level.
//
// Any open spaces we find here (there should be one at most) is a
// solution.  Furthermore, it has a symmetric solution.
          backtrack[N-2] = backtrack[N-3];
          backtrack[N-2].place_queen( N-2, j[N-2] );

          const uint_fast k = backtrack[N-2].first_safe_space(N-1);

          if ( k < 1 << N )
        {
          if ( N % 2 == 1 && k == 1 << (N >> 1) )
// Our last queen is in the center column.
            solutions += 4;
          else
            solutions += 2;
        }

          j[N-2] = backtrack[N-3].next_safe_space( N-2, j[N-2] );
        }
        else
        {
// Descend.
          backtrack[current_row] = backtrack[current_row - 1];
          backtrack[current_row].place_queen(current_row, j[current_row] );

          ++current_row;
          j[current_row] =
backtrack[current_row-1].first_safe_space(current_row);
        } // end if
      } // end while
    } // end for

    return solutions;
  }

template <>
  long solve_queens<1> (void)
  {
    return 1;
  }

template <>
  long solve_queens<2> (void)
  {
    return 0;
  }

long queens( unsigned n )
{
  switch (n)
  {
    case  1: return solve_queens<  1 > (); break;
    case  2: return solve_queens<  2 > (); break;
    case  3: return solve_queens<  3 > (); break;
    case  4: return solve_queens<  4 > (); break;
    case  5: return solve_queens<  5 > (); break;
    case  6: return solve_queens<  6 > (); break;
    case  7: return solve_queens<  7 > (); break;
    case  8: return solve_queens<  8 > (); break;
    case  9: return solve_queens<  9 > (); break;
    case 10: return solve_queens< 10 > (); break;
    case 11: return solve_queens< 11 > (); break;
    case 12: return solve_queens< 12 > (); break;
    case 13: return solve_queens< 13 > (); break;
    case 14: return solve_queens< 14 > (); break;
    case 15: return solve_queens< 15 > (); break;
    case 16: return solve_queens< 16 > (); break;
    case 17: return solve_queens< 17 > (); break;
    case 18: return solve_queens< 18 > (); break;

    default: return -1;
// If you're getting -1, expand the switch block.
  }
}

int main(void)
{
  clock_t begin, end;
  clock_t total = 0;

  unsigned i;

  for ( i = start; i <= finish; ++i )
  {
    long answer;

    begin = clock();
    answer = queens(i);
    end = clock();

    total += end - begin;

    cout << setw(8);
    cout << static_cast<double>( 1000 * (end - begin) ) / CLOCKS_PER_SEC;
    cout << " ms # of solutions for ";
    cout << setw(2) << i << " x ";
    cout << setw(2) << i << " = ";
    cout << answer << endl;
  }

  cout << "Total time: ";
  cout << static_cast<double>( 1000 * total ) / CLOCKS_PER_SEC;
  cout << " ms." << endl;

  return EXIT_SUCCESS;
}

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

New benchmarks, this time for Win32:

First, a run compiled with the default release settings. This is already running in half the time as before.

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
       0 ms # of solutions for 11 x 11 = 2680
      16 ms # of solutions for 12 x 12 = 14200
     172 ms # of solutions for 13 x 13 = 73712
     953 ms # of solutions for 14 x 14 = 365596
    5390 ms # of solutions for 15 x 15 = 2279184
   39515 ms # of solutions for 16 x 16 = 14772512
  263011 ms # of solutions for 17 x 17 = 95815104
Total time: 309057 ms.
Press any key to continue . . .

Now, to enable better optimizations in the project settings. (Don't worry, you'll get them, too.)

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
      15 ms # of solutions for 11 x 11 = 2680
      16 ms # of solutions for 12 x 12 = 14200
     125 ms # of solutions for 13 x 13 = 73712
     765 ms # of solutions for 14 x 14 = 365596
    4875 ms # of solutions for 15 x 15 = 2279184
   30906 ms # of solutions for 16 x 16 = 14772512
  255995 ms # of solutions for 17 x 17 = 95815104
Total time: 292697 ms.
Press any key to continue . . .

A 62% improvement in speed.

Frenchfrog's:

Code:

         0 # of solutions for 1 x 1 = 1
         0 # of solutions for 2 x 2 = 0
         0 # of solutions for 3 x 3 = 0
         0 # of solutions for 4 x 4 = 2
         0 # of solutions for 5 x 5 = 10
         0 # of solutions for 6 x 6 = 4
         0 # of solutions for 7 x 7 = 40
         0 # of solutions for 8 x 8 = 92
         0 # of solutions for 9 x 9 = 352
         0 # of solutions for 10 x 10 = 724
         0 # of solutions for 11 x 11 = 2680
         0 # of solutions for 12 x 12 = 14200
        46 # of solutions for 13 x 13 = 73712
       266 # of solutions for 14 x 14 = 365596
      1625 # of solutions for 15 x 15 = 2279184
     11125 # of solutions for 16 x 16 = 14772512
     77311 # of solutions for 17 x 17 = 95815104
Press any key to continue . . .

That's amazing. I wonder how mine will do on Linux, but you're way ahead on Windows.

Edit: Eliminating the overhead of the lookup table for attack positions brought me closer, but I'll still need to refactor the loop.

Code:

       0 ms # of solutions for  1 x  1 = 1
       0 ms # of solutions for  2 x  2 = 0
       0 ms # of solutions for  3 x  3 = 0
       0 ms # of solutions for  4 x  4 = 2
       0 ms # of solutions for  5 x  5 = 10
       0 ms # of solutions for  6 x  6 = 4
       0 ms # of solutions for  7 x  7 = 40
       0 ms # of solutions for  8 x  8 = 92
       0 ms # of solutions for  9 x  9 = 352
       0 ms # of solutions for 10 x 10 = 724
       0 ms # of solutions for 11 x 11 = 2680
      16 ms # of solutions for 12 x 12 = 14200
     125 ms # of solutions for 13 x 13 = 73712
     765 ms # of solutions for 14 x 14 = 365596
    4328 ms # of solutions for 15 x 15 = 2279184
   31734 ms # of solutions for 16 x 16 = 14772512
  213888 ms # of solutions for 17 x 17 = 95815104
Total time: 250856 ms.
Press any key to continue . . .

frenchfrog · **Joined:**Sun Jun 18, 2006 3:06 pm **Posts:**69

After a quick look at your code, perhaps transform in place_queen is doing too much work? There is no need to calcule immediately the attack matrix (1 + row) to N since a lot of impossible placement are detected before reaching N?

Lorehead · **Joined:**Thu Aug 09, 2007 9:09 am **Posts:**26

Quote:

After a quick look at your code, perhaps transform in place_queen is doing too much work? There is no need to calcule immediately the attack matrix (1 + row) to N since a lot of impossible placement are detected before reaching N?

What it should be doing, unless I misread the documentation tremendously, is starting with the row after the one we checked, and keep descending to the bottom of the board. The start iterator (a pointer into the array) points just past row in squares: it's the element squares[row + 1]. (The row variable is an index into the array.) Unless there's something I've overlooked.

A C-style loop did not speed up the code, so that doesn't seem to be a bottleneck. In theory, the semantics of transform should make it easier for a compiler to parallelize or vectorize that loop, and MSVC doesn't seem to be choking on it.

I have another idea for how to optimize the program, and it focuses on the loop. I'll be trying that first.

frenchfrog · **Joined:**Sun Jun 18, 2006 3:06 pm **Posts:**69

I wasn't clear enough, transform is fine, the problem is it's use, if I haven't misread your code, if it's placing the queen on a given row, transform will be called on all the rows from that current row plus 1 up to N. This means that if you backtrack before the last row some calculation will have been made for nothing. Is it a huge impact? I honestly don't know.

I would also like to point out the other brillant idea you had to eliminate the need to try the middle row on odd N

Challenge Two: N-Queens

Who is online