# Solving JigSaw Puzzle Using Neural Nets

## Can Neural Net solve a 2x2 Jigsaw Puzzle?

Training a Neural Net on **permutation invariant** data is difficult. And a jigsaw puzzle is one of those.

# What is Permutation Invariance?

A function is a permutation invariant if its output does not change by changing the ordering of its input. Here is an example below.

1) f(x,y,z) = ax + by +cz2) f(x,y,z) = xyz

The output of the 1st function will change if we change the input order, but the output of the 2nd function will not change. So 2nd function is permutation invariant.

Weights of the neural net maps to specific input units. When the input changes, the output will change too. So in order to learn this symmetry, weights should be such as that final output is unchanged even after changing the input. Which is not easy to learn by the feed-forward network.

A jigsaw puzzle is also permutation invariance. No matter what the ordering of puzzle pieces are the output would always be fixed. Here is an example of a `2x2`

grid puzzle, which we would be trying to solve in this project.

Solving a `3x3`

grid puzzle is extremely difficult. The following are possible combinations of these puzzles.

2x2 puzzle = 4! = 24 combinations3x3 puzzle = 9! = 362880 comb’ns

To solve a `3x3`

puzzle the network has to predict one correct combination out of `362880`

. This is one more reason why `3x3`

the puzzle is a tough one.

Let’s move forward and try to solve a `2x2`

Jigsaw puzzle.

# How did you get the data?

There was not any public dataset available for Jigsaw Puzzles, so I had to create it myself. I created the data as follows.

- Took a raw dataset containing around
`26K`

animal images. - Cropped all images into a fixed size of
`200x200`

. - Split the images into
`train`

,`test`

and`validation`

set. - Cut the images into 4 pieces and randomly rearranged them.
- For the training set, I have repeated the previous step 4 times to augment the data.
- Finally, we have 92K training images and 2K testing images. I have also separated out 300 images for validation.
- The label is an integer array that denotes the correct position of each puzzle piece.

This dataset contains both `2x2`

and `3x3`

puzzle. You can find it here.

# But how does the data look?

Following is a data sample of `2x2`

grid puzzle. Input is a `200x200`

pixel image and label is an array of 4 integers, where each integer tells the correct position of each piece.

Our goal is to feed this image into a neural net and get an output which is a vector of 4 integers indicating the correct position of each piece.

# How did you design the Network?

After trying more than 20 neural net architecture and a lot of trial and error I came up with an optimal design. Which is as follows.

- First, extract each puzzle piece from the image (total 4).
- Then pass each piece through the CNN. CNN extracts useful features and outputs a feature vector.
- We concatenate all 4 feature vectors into one using the Flatten layer.
- Then we pass this combined vector through a feed-forward network. The last layer of this network gives us a 16 unit long vector.
- We reshape this 16 unit vector into a matrix of
`4x4`

.

## Why do we reshape?

In a normal classification task, neural networks output a score for each class. We convert that score into probability by applying a softmax layer. The class which has the highest probability value is our predicted class. This is how we do classification.

The situation is different here. We want to classify each piece into its correct position `(0, 1, 2, 3)`

. And there are 4 such pieces. So we need 4 vectors(for each piece) each having 4 scores(for each position), which is nothing but a `4x4`

matrix. Where rows correspond to pieces and columns to score. Finally, we apply a softmax on this output matrix row-wise.

The following is the network diagram.

# How to code the Neural Net?

I am using Keras framework for this project. Following is the complete network implemented in the Keras. Which looks fairly simple.

# Can you explain the code?

As you see, the input shape is `(4,100,100,3)`

. Means I am feeding 4 images(puzzle pieces) of shape `(100,100,3)`

as an input to the network.

As you see, I am using **Time-Distributed**(TD) layers. TD layer applies a given layer multiple times over an input. Here the TD layer will apply the same convolutional layer over 4 input images *(line: 5, 9, 13, 17)*.

In order to use TD layers, we have to give one extra dimension in the input, over which TD layer applies a given layer multiple times. Here we are giving one extra dimension, which is the number of images. As a result, we get 4 feature vectors for all 4 image pieces.

Once the CNN feature extraction is done, we concatenate all the features using the **Flatten** layer *(line: 21)*. Then pass the vector through a feed-forward network. Reshape the final output to a `4x4`

matrix and apply a softmax *(line 29, 30)*.

# Can you explain CNN architecture?

This task is completely different from a normal classification task. In normal classification task network focuses more on the central region of the image. But in the case of Jigsaw, the edge information is much more important than the central one.

So my CNN architecture is different from the usual one in the following ways.

## Padding

I am using some extra **padding** around the image before passing it through CNN *(line: 3)*. And also padding the feature map before each convolution operation (

) to protect as much edge info as possible.*padding = same*

## MaxPooling

I am avoiding **pooling** layers and using just one MaxPool layer to reduce the feature map size *(line: 7)*. Pooling makes the Network *translation invariance, *which* *means even if you rotate of jiggle the object in the image, the network would still detect it. Which is good for any object classification task.

But here we don’t want the network to be *translation invariance*. Our network should be sensitive to variance. Since our edge information is very sensitive.

## Shallow Network

We know that top layers in **CNN** extract feature like edges, corners, etc. And as we go deep, layers tend to extract features like shape, color distribution, etc. Which are not much relevant for our case, so creating a shallow network will help here.

These all are the important details you need to know about CNN architecture. The rest of the network is fairly simple having 3 feed-forward layers, a reshape layer, and finally a softmax layer.

# What is the training process?

Finally, I compile my model with `sparse_categorical_crossentropy`

loss and `adam`

optimizer. Our target would be a 4 unit vector telling the correct position of each piece.

`Target Vector: [[3],[0],[1],[2]]`

I trained the network for 5 epochs. I started with learning rate `0.001`

and batch size `64`

. After each epoch, I am reducing the learning rate and increasing batch size.

# How are the results?

While prediction, our network outputs a 4x4 vector, then we select the index having a maximum value in each row, which is nothing but the predicted position. Thus we get a vector of length 4. Using this vector we can also re-arrange the puzzle pieces and visualize them.

After training, I ran the model on 2K unseen puzzles, and the model was able to solve the 80% puzzle correctly. Which is quite fair.

Here are the few samples solved by the network.

Following is the complete project hosted on GitHub.

If you enjoyed this article, then you should also check out the following article.

I will keep posting more such exciting projects in the future. You can also **join**** **my mailing list to get my latest content directly in your inbox. Thanks for reading!