Deep Sets

Introduction

Most classic neural network architectures, like the Multilayer Perceptron (MLP), are designed for a very specific kind of input: a fixed-size vector. If you have an input vector \((x_1, \ldots, x_N)^\top\), the network learns a specific weight for each position. The first element is treated differently from the second, and so on. This is perfect for structured data like a row in a spreadsheet. But what if your data is a set?

Imagine you want to classify a 3D object represented as a point cloud \(\vb{X}\), i.e., a collection of thousands of coordinates \(\vb{r}\equiv (x, y, z)^\top\):

\[\vb{X} = (\vb{r}_1, \ldots, \vb{r}_M)^\top = \pmqty{x_1 & y_1 & z_1 \\ x_2 & y_2 & z_2 \\ \vdots & \vdots & \vdots \\ x_M & y_M & z_M}.\]

Does the order in which you list these points matter? Of course not. The object is the same regardless of which point you start with. However, an MLP trained on \((\vb{r}_1, \ldots, \vb{r}_M)^\top\) would produce a different result than for the input \((\vb{r}_M, \ldots, \vb{r}_1)^\top\), for example. This is a fundamental mismatch between our model’s assumption and our data’s reality. We need a model that is inherently indifferent to the order of its inputs. This property has a name: permutation invariance.

Formalizing the Challenge

Let’s put this into more formal terms. A typical machine learning problem involves learning a function \(f\) that maps an input domain \(\mathcal{X}\) to an output range \(\mathcal{Y}\). For standard vectors, \(\mathcal{X}\) is a space like \(\R^d\).

When our inputs are sets, the domain changes. A set, \(X\), is an unordered collection of distinct elements, \(\{\vb{x}_1, \ldots, \vb{x}_M\}\), where each individual element, \(\vb{x}_i\) (\(i=1, \ldots, M\)), comes from what is called a universe \(\mathcal{U}\). In the case of a point cloud, the universe \(\mathcal{U}\) would be \(\R^3\). The key insight is that, in sets, order does not matter. For example, \(\{\vb{x}_1, \vb{x}_2, \vb{x}_3\} = \{\vb{x}_3, \vb{x}_1, \vb{x}_2\} = \{\vb{x}_2, \vb{x}_3, \vb{x}_1\}\). Additionally, the size of the set, \(M\), can vary from one input to another. Therefore, the input domain for our function \(f\) is the set of all possible subsets of \(\mathcal{U}\), including the empty set, \(\emptyset\), and \(\mathcal{U}\) itself. This is known in Mathematics as the power set of \(\mathcal{U}\), often denoted as \(2^{\mathcal{U}}\). For example, if \(\mathcal{U} = \{a, b\}\), then \(2^{\mathcal{U}} = \{\emptyset, \{a\}, \{b\}, \{a,b\}\}\).

Now we can formalize our intuition. Our goal is to learn a function \(f: 2^{\mathcal{U}} \rightarrow \mathcal{Y}\) that is permutation invariant. This means that for any set \(X = \{\vb{x}_1, \ldots, \vb{x}_M\}\) and any permutation \(\pi\) (which is just a reordering of the indices \(1, \ldots, M\)), the following must hold:

\[f\pqty{\pqty{\vb{x}_{π(1)}, \ldots, \vb{x}_{π(M)}}^\top} = f\pqty{\pqty{\vb{x}_1, \ldots, \vb{x}_M}^\top}.\]

If we represent \(X\) as a matrix \(\vb{X}\equiv (\vb{x}_1,\ldots, \vb{x}_M)^\top\) and \(\vb*{\Pi}\) is the permutation matrix associated with \(\pi\), then we can write this condition as

\[f(\vb*{\Pi}\vb{X}) = f(\vb{X}).\]

But here’s the million-dollar question: What do permutation-invariant functions actually look like? Can we characterize their structure mathematically? Can we design a neural network architecture that guarantees this property by its very structure? This is where the Deep Sets paper (Zaheer et al. 2017) made its groundbreaking contribution.

Deep Sets

The core theoretical result of Deep Sets is beautifully simple yet profound:

A function \(f\) operating on a finite set \(X\) is permutation invariant if and only if it can be decomposed as:

\[f(X) = \rho\left(\sum_{\vb{x} \in X} \phi(\vb{x})\right)\]

for suitable functions \(\phi\) and \(\rho\).

It is quite intuitive to see why this condition is sufficient. First, \(\phi: \mathcal{U} \rightarrow \mathbb{R}^d\) maps each element independently to a representation, so it is unaffected by permutations of the set elements. On the other hand, the summation \(\sum_{x \in X}\), which aggregates these representations, is commutative, i.e., permutation-invariant. Other commutative aggregators, generally denoted as \(\bigoplus_{\vb{x}\in X}\), such as the mean or the maximum, also work and are used in practice. Finally, \(\rho: \mathbb{R}^d \rightarrow \mathcal{Y}\), which maps the aggregated representation to the final output, is also unaffected by permutations of the set elements, since it receives as input the result of the aggregation, which is permutation-invariant.

The beauty of this result lies in its universality: every permutation-invariant function can be written this way, and every function of this form is permutation-invariant. However, it is important to note that this is an informal summary of the original paper’s results, which should be interpreted with caution when \(\mathcal{U}\) is uncountable and \(M\) is not fixed (see (Zaheer et al. 2017) for more details).

Nevertheless, this result gives us a valuable blueprint. Since neural networks are universal function approximators, it is possible to build a deep learning model to approximate a permutation-invariant function \(f\) by simply replacing \(\phi\) and \(\rho\) with neural networks. And that’s it! Moreover, the resulting model can handle sets of variable sizes, because the aggregation operation works just as well for 10 elements as it does for 10,000.

Additional results are given in (Zaheer et al. 2017) for permutation equivariant functions, i.e., functions for which permuting the inputs results in the same permutation of the outputs. This can be useful in cases where we do not want a single output for the entire set—we want an output for each element that is informed by the entire set context. However, for simplicity, we will focus on permutation-invariant functions in this tutorial.

Connection to the Geometric Deep Learning Blueprint

Now, let’s zoom out. Is Deep Sets just a one-off trick, or is it part of a bigger picture? This is where Geometric Deep Learning (GDL) comes in. The GDL “Blueprint” (Bronstein et al. 2021) provides a unified framework for understanding architectures like CNNs, GNNs, and Transformers. It sees them as networks that respect the geometry and symmetries of their input domains. According to the GDL blueprint, for a deep learning architecture to effectively process elements of a domain \(\Omega\) under a symmetry described by a certain group \(\mathfrak{G}\), it must be properly constructed using the following key building blocks:

Linear \(\mathfrak{G}\)-equivariant layer \(B\): A layer satisfying \(B(g\cdot x) = g\cdot B(x),\;\forall g\in\mathfrak{G}\). If you transform the input, the output transforms in the same way.
Nonlinearity \(\sigma\): An activation function applied element-wise.
Local pooling (coarsening) \(P\): An operator that reduces the resolution of the domain, such that the new domain is a compact version of the original.
\(\mathfrak{G}\)-invariant layer (global pooling) \(A\): A layer satisfying \(A(g\cdot x) = A(x),\;\forall g\in\mathfrak{G}\). They produce an output that is insensitive to the domain’s symmetries.

How does Deep Sets fit into this? Perfectly. In this case, the input domain are sets and the corresponding symmetries are permutations of the elements (the permutation group). Let’s look at the Deep Sets architecture through this lens:

The action of the \(\phi\) network can be seen as a combination of permutation-equivariant layers and nonlinearities, resulting in a permutation-equivariant layer: \(f(\vb{X}) = (\phi(\vb{x}_1), \ldots, \phi(\vb{x}_M))^\top\). Since \(\phi\) is applied element-wise, if you permute the input set elements \((\vb{x}_1, \ldots, \vb{x}_M)^\top\), the output embeddings from \(\phi\) are simply \((\phi(\vb{x}_1), \ldots, \phi(\vb{x}_M))^\top\) but in the same permuted order.
The summation, \(\sum\), is a global pooling operation that creates a permutation-invariant representation. No matter how you permute the embeddings before the sum, the result is identical. Moreover, since the input for the \(\rho\) network is already permutation-invariant, the composition \(\rho \,\circ\, \sum\) can be seen as a permutation-invariant layer.

As a result, the entire architecture \(f = \rho \,\circ\, \sum\; \circ\, \phi\) is a permutation-invariant function.

This reveals something beautiful: Deep Sets is the GDL blueprint applied to the simplest non-Euclidean domain—a set. A Convolutional Neural Network (CNN) is the same blueprint applied to a grid, respecting translational symmetry. A Graph Neural Network (GNN) applies it to a graph, respecting permutation symmetry of the nodes. Deep Sets is, in essence, a GNN on a graph with no edges! This unified perspective helps us understand why these architectures work so well. They are not just collections of layers that perform well empirically; they are principled constructions that correctly embed the fundamental symmetries of the data they are designed for.

Applications

The applications of Deep Sets span numerous domains:

Point Cloud Processing: A point cloud is a set of low-dimensional vectors. This type of data is frequently encountered in applications such as robotics, computer vision, and cosmology. For example, in computer vision and robotics, 3D point clouds are inherently unordered. Deep Sets architectures, like PointNet (Qi et al. 2016), revolutionized 3D object classification and segmentation.
Set Expansion: This task begins with a small “seed” set of items that share a common characteristic. The objective is to automatically discover other items from a larger collection that also belong to this implicit category. The model must first infer the underlying concept that links the items in the query set and then use this understanding to retrieve relevant new members. It is an important task due to a wide range of potential applications including personalized information retrieval, computational advertisement, and tagging large amounts of unlabeled or weakly labeled datasets.
Population Statistics: Estimating properties of distributions from samples—the order of samples shouldn’t matter, making this a natural Deep Sets application.
Estimating cosmological parameters: In cosmology, a critical task is determining the red-shift of a galaxy from photometric data, which indicates its age and distance. One way to estimate the red-shift from photometric observations is using a regression model (Connolly et al. 1995) on the galaxy clusters. A galaxy cluster is a natural example of a set, as its properties are unaffected by the order of its member galaxies. Consequently, Deep Sets can be applied to process the entire cluster as a single input, leveraging collective information to produce accurate red-shift estimates for each galaxy.
Anomaly Detection: The goal is to find the outlier in a set (like identifying the one different face in a group), which requires permutation-equivariant processing.

Code Example with PyTorch

In this section, we will walk through a practical implementation of the Deep Sets architecture using PyTorch. We’ll demonstrate how to build a permutation-invariant neural network for classifying 3D objects represented as point clouds, using the ModelNet10 dataset (Zhirong Wu et al. 2015) as a real-world example. The code will cover all steps—from data preprocessing and model definition to training, evaluation, and visualization—providing a hands-on guide to applying Deep Sets to unordered set data.

Dependencies

We begin by importing the necessary libraries for our implementation. This includes standard Python libraries for numerical computation and visualization, as well as PyTorch and PyTorch Geometric for deep learning and point cloud data handling. Setting a random seed ensures reproducibility of our results.

import os
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
# Pytorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
# PyTorch Geometric for ModelNet10 dataset and aggregators
import torch_geometric.transforms as T
from torch_geometric.datasets import ModelNet
from torch_geometric.nn import aggr

# Set random seed for reproducibility
seed = 123
torch.manual_seed(seed)

Model

Now, we define the Deep Sets model as a PyTorch module. The architecture follows the theoretical blueprint: each element of the set is independently embedded by a neural network, the embeddings are aggregated using a permutation-invariant operation (such as sum, mean, or max), and the result is passed through another neural network to produce the final output. The aggregator can be chosen to suit the task, and the model is flexible to different input and output dimensions.

class DeepSets(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, aggregator='sum', dropout=0.0):
        super().__init__()
        self.psi = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )

        if aggregator == 'max':
            self.aggregator = aggr.MaxAggregation()
        elif aggregator == 'mean':
            self.aggregator = aggr.MeanAggregation()
        elif aggregator == 'sum':
            self.aggregator = aggr.SumAggregation()
        else:
            raise ValueError(f"Unknown aggregator: {aggregator}")

        self.phi = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        h = self.psi(x)
        h = self.aggregator(h, dim=1).squeeze(1)
        y = self.phi(h)

        return y

Data

Next, we prepare the data for our experiments. We use the ModelNet10 dataset (Zhirong Wu et al. 2015), which contains 3D CAD models from ten categories. Each model is represented as a point cloud. We define a preprocessing pipeline to sample a fixed number of points from each mesh, apply random rotations for data augmentation, and normalize the scale of each point cloud. The dataset is then loaded and split into training and test sets.

N_POINTS = 1024 
NUM_CLASSES = 10
DATA_DIR = './data/ModelNet10'
CLASS_NAMES = [
    'Bathtub', 'Bed', 'Chair', 'Desk', 'Dresser',
    'Monitor', 'Night stand', 'Sofa', 'Table', 'Toilet'
]

pre_transform = T.Compose([
    T.SamplePoints(num=N_POINTS, remove_faces=True, include_normals=False),
    T.RandomRotate(180, axis=2),
    T.NormalizeScale(),
])

train_dataset = ModelNet(
    root=DATA_DIR, name='10', train=True, pre_transform=pre_transform
)
test_dataset = ModelNet(
    root=DATA_DIR, name='10', train=False, pre_transform=pre_transform
)

print(f"Training dataset size: {len(train_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

Training dataset size: 3991
Test dataset size: 908

To feed the data into our Deep Sets model, we need to convert the PyTorch Geometric dataset objects into tensors suitable for batch processing. The following function extracts the point coordinates and labels from each sample and stacks them into tensors, resulting in arrays of shape [num_samples, num_points, 3] for the data and [num_samples] for the labels.

def get_coords_and_labels(dataset):
    X_list = []
    y_list = []
    for i in range(len(dataset)):
        data = dataset[i]
        X_list.append(data.pos.unsqueeze(0)) # Add batch dimension
        y_list.append(data.y)

    X = torch.cat(X_list, dim=0).float()
    y = torch.cat(y_list, dim=0).long()

    return X, y

Once the helper function is defined, we apply it to both the training and test datasets to obtain the tensors that will be used as input and target labels for the model. We also print their shapes to verify that the data has been correctly processed.

X_train, y_train = get_coords_and_labels(train_dataset)
X_test, y_test = get_coords_and_labels(test_dataset)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_train shape: torch.Size([3991, 1024, 3]), y_train shape: torch.Size([3991])
X_test shape: torch.Size([908, 1024, 3]), y_test shape: torch.Size([908])

Now, we visualize a few examples from the processed test set. This step helps us confirm that the point clouds have been correctly sampled and preprocessed, and provides an intuitive sense of the data the model will learn from.

Code

def plot_modelnet_clouds(samples_per_class, X, y_true, y_pred=None):
    cmap = plt.get_cmap('tab10')
    fig, axes = plt.subplots(
        NUM_CLASSES,
        samples_per_class,
        figsize=(1.5 * samples_per_class, 1.5 * NUM_CLASSES),
        subplot_kw={'projection': '3d'}
    )

    for i in range(NUM_CLASSES):
        sample_indices = np.where(y_true == i)[0][:samples_per_class]
        for j, id in enumerate(sample_indices):
            axes[i, j].scatter(
                X[id, :, 0], X[id, :, 1], X[id, :, 2], s=5, color=cmap(i)
            )
            
            title = CLASS_NAMES[y_true[id]]
            if y_pred is not None:
                title = "Actual: " + title + f"\nPrediction: {CLASS_NAMES[y_pred[id]]}"

            axes[i, j].set_title(title, fontsize=8)
            axes[i, j].set_xticks([])
            axes[i, j].set_yticks([])
            axes[i, j].set_zticks([])
            axes[i, j].view_init(elev=30, azim=45)

    plt.tight_layout()
    plt.show()

# Visualize a few examples from the test set
plot_modelnet_clouds(5, X_test, y_test)

Training

With the data ready, we proceed to set up the model, loss function, optimizer, and device. We specify the input, hidden, and output dimensions, choose the aggregation method, and move both the model and data to the appropriate device (CPU or GPU) for efficient computation.

INPUT_DIM = 3  # x, y, z coordinates
HIDDEN_DIM = 32
OUTPUT_DIM = NUM_CLASSES
AGGREGATOR = 'max'
DROPOUT = 0.1
LEARNING_RATE = 1e-3

# Initialize the model
model = DeepSets(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, AGGREGATOR, DROPOUT)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Move data and model to GPU once if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)
X_train = X_train.to(device)
y_train = y_train.to(device)
X_test = X_test.to(device)
y_test = y_test.to(device)

Using device: cuda

To efficiently train the model, we use a DataLoader to create mini-batches from the training data. This allows for faster and more stable optimization, especially when working with large datasets.

BATCH_SIZE = 32

train_dataset = TensorDataset(X_train, y_train)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

Now, we define a function to perform a single training step on a batch of data. This function handles the forward pass, loss computation, backpropagation, and parameter update for each mini-batch.

def train_step(model, optimizer, criterion, batch_X, batch_y):
    model.train()
    optimizer.zero_grad()
    logits = model(batch_X)
    loss = criterion(logits, batch_y)
    loss.backward()
    optimizer.step()

For evaluation, we create a function that computes the model’s predictions and accuracy on a given dataset. This function can also return the loss if a criterion is provided, making it useful for monitoring both training and validation performance.

def evaluate_model(model, X, y, criterion=None):
    model.eval()
    with torch.no_grad():
        logits = model(X)
        loss = criterion(logits, y).item() if criterion else None
        probs = torch.softmax(logits, dim=1)
        predicted_classes = torch.argmax(probs, dim=1)
        accuracy = (predicted_classes == y).sum().item() / len(y)

    return accuracy, predicted_classes, loss

The following function is used to print training and test metrics at regular intervals during training. This helps track the model’s progress and diagnose potential issues such as overfitting or underfitting.

def print_logs(epoch, history, period=10):
    if (epoch + 1) % period == 0:
        logs = []
        for subset, subhistory in history.items():
            for metric, values in subhistory.items():
                value = f'{values[-1]:.4f}'
                key = ' '.join([subset, metric]).capitalize()
                logs.append(' = '.join([key, value]))

        print(f'Epoch {epoch + 1}', *logs, sep=', ')

We are now ready to train the Deep Sets model. The following loop iterates over the specified number of epochs, performing training on mini-batches and evaluating the model on both the training and test sets after each epoch. The results are stored for later visualization.

EPOCHS = 100
history = {s: {'loss': [], 'accuracy': []} for s in ('train', 'test')}
for epoch in range(EPOCHS):
    for batch_X, batch_y in train_dataloader:
        train_step(model, optimizer, criterion, batch_X, batch_y)

    train_accuracy, _, train_loss = evaluate_model(model, X_train, y_train, criterion)
    history['train']['accuracy'].append(train_accuracy)
    history['train']['loss'].append(train_loss)

    test_accuracy, _, test_loss = evaluate_model(model, X_test, y_test, criterion)
    history['test']['accuracy'].append(test_accuracy)
    history['test']['loss'].append(test_loss)

    print_logs(epoch, history)

Epoch 10, Train loss = 0.5509, Train accuracy = 0.7913, Test loss = 0.8021, Test accuracy = 0.6729
Epoch 20, Train loss = 0.4022, Train accuracy = 0.8554, Test loss = 0.6051, Test accuracy = 0.7930
Epoch 30, Train loss = 0.3544, Train accuracy = 0.8752, Test loss = 0.5518, Test accuracy = 0.8194
Epoch 40, Train loss = 0.3121, Train accuracy = 0.8905, Test loss = 0.4944, Test accuracy = 0.8392
Epoch 50, Train loss = 0.2573, Train accuracy = 0.9093, Test loss = 0.4279, Test accuracy = 0.8645
Epoch 60, Train loss = 0.3168, Train accuracy = 0.8865, Test loss = 0.5256, Test accuracy = 0.8315
Epoch 70, Train loss = 0.2547, Train accuracy = 0.9108, Test loss = 0.4438, Test accuracy = 0.8623
Epoch 80, Train loss = 0.2790, Train accuracy = 0.8983, Test loss = 0.5038, Test accuracy = 0.8271
Epoch 90, Train loss = 0.2552, Train accuracy = 0.9078, Test loss = 0.4747, Test accuracy = 0.8392
Epoch 100, Train loss = 0.2716, Train accuracy = 0.9000, Test loss = 0.5057, Test accuracy = 0.8326

Results

After training, we visualize the evolution of the loss and accuracy for both the training and test sets. This provides insight into the learning dynamics and helps assess whether the model is generalizing well.

Code

fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for subset, subset_history in history.items():
    for i, (key, value) in enumerate(subset_history.items()):
        axes[i].plot(value, label=subset.capitalize())
        axes[i].set_xlabel('Epoch')
        axes[i].set_ylabel(key.capitalize())
        axes[i].legend()

plt.tight_layout()
plt.show()

On the left, the loss curves for both the training and test sets exhibit a rapid initial decrease. This is typical as the model quickly learns to fit the most salient patterns in the data. After this initial phase, the training loss continues to decrease and eventually plateaus at a low value, indicating that the model is able to fit the training data well. The test loss follows a similar trajectory, though it stabilizes at a slightly higher value than the training loss.

On the right, the accuracy curves tell a complementary story. Both training and test accuracy increase sharply during the early epochs, reflecting the model’s ability to quickly capture the underlying structure of the data. The training accuracy eventually saturates at a high value, suggesting that the model capacity is sufficient for the task. The test accuracy also reaches a high level, but remains consistently below the training accuracy. This gap between training and test performance suggests a modest degree of overfitting. However, the gap is not excessive, and the test accuracy remains high and stable throughout the later epochs, indicating that the model is not suffering from severe overfitting.

Both loss and accuracy curves are smooth, without erratic jumps or oscillations, indicating that the optimization process is stable and the learning rate is well-chosen. In general, the training dynamics suggest that the Deep Sets model is able to learn a robust representation of the point cloud data, achieving strong performance on both the training and test sets. The small but persistent gap between training and test accuracy is a reminder that further improvements could potentially be achieved through regularization, data augmentation, or hyperparameter tuning, but the current results already demonstrate the effectiveness of the Deep Sets architecture for permutation-invariant learning on unordered sets.

Next, we visualize the model’s predictions on the test set. By comparing the predicted and actual classes for a selection of point clouds, we can qualitatively assess the model’s performance and identify any systematic errors.

Code

# Get final predictions on the test set
_, y_test_pred, _ = evaluate_model(model, X_test, y_test)

# Visualize predictions (move tensors to CPU for plotting)
plot_modelnet_clouds(5, X_test.cpu(), y_test.cpu(), y_test_pred.cpu())

The Deep Sets model correctly classifies most examples, especially for visually distinct categories like “Chair” and “Monitor.” Misclassifications tend to occur between geometrically similar objects, such as “Desk” and “Table” or “Sofa” and “Bed,” which is expected given the overlap in their shapes. Overall, the model demonstrates strong generalization, with errors largely confined to ambiguous cases where class boundaries are inherently fuzzy in the point cloud representation.

To quantitatively assess the model’s performance, we visualize the confusion matrix, which reveals patterns in the model’s successes and misclassifications, highlighting which classes are most easily confused. This diagnostic step helps us understand the strengths and weaknesses of our Deep Sets model in a more granular way.

Code

def plot_confusion_matrix(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred, normalize='true')
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        cm, annot=True, fmt='.2f', cmap='magma',
        xticklabels=CLASS_NAMES, yticklabels=CLASS_NAMES
    )
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

# Example usage after evaluation:
# _, y_test_pred, _ = evaluate_model(model, X_test, y_test)
plot_confusion_matrix(y_test.cpu().numpy(), y_test_pred.cpu().numpy())

Ideally, we would expect to see strong diagonal dominance—high values along the diagonal and near-zero elsewhere—indicating that most samples are correctly classified. This is largely the case for several classes, such as “Bed,” “Chair,” “Monitor,” and “Sofa,” which all exhibit high true positive rates (values close to 1.0 on the diagonal), reflecting the model’s ability to reliably distinguish these categories.

However, the matrix also reveals some notable patterns of confusion. For instance, “Bathtub” is frequently misclassified as “Bed,” as indicated by the relatively high off-diagonal value in the first row and second column. Similarly, “Desk” and “Table” are often confused with each other, which is not surprising given their geometric similarities in point cloud representation. These off-diagonal entries highlight the model’s difficulty in distinguishing between certain classes that share similar shapes or features. One possible reason for this could be class imbalance.

Examining the distribution of classes in the training set may provide valuable context for interpreting the previous results. An imbalanced dataset can bias the model towards more frequent classes, so visualizing the class distribution allows us to assess whether such imbalances might be influencing our outcomes.

Code

def plot_class_distribution(y, class_names):
    counts = np.bincount(y, minlength=len(class_names))
    cmap = plt.get_cmap('tab10')
    colors = [cmap(i) for i in range(len(class_names))]
    plt.figure(figsize=(8, 4))
    plt.bar(class_names, counts, color=colors)
    plt.ylabel('Number of Instances')
    plt.xlabel('Class')
    plt.title('Class Distribution in Training Set')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

plot_class_distribution(y_train.cpu().numpy(), CLASS_NAMES)

The bar chart above illustrates the distribution of training samples across the ten ModelNet10 categories. It is immediately apparent that the dataset is not perfectly balanced: some classes, such as “Chair,” “Sofa,” and “Monitor,” are represented by a substantially larger number of instances, while others—most notably “Bathtub”—have far fewer examples. This imbalance is important to recognize, as it can bias the model during training, making it more likely to favor the majority classes and potentially leading to poorer generalization on underrepresented categories.

For example, the relatively small number of “Bathtub” samples may help explain the model’s difficulty in correctly classifying this class, as seen in the confusion matrix. Conversely, the abundance of “Chair” and “Sofa” instances likely contributes to the model’s strong performance on these categories. Such imbalances are common in real-world datasets and highlight the importance of considering class distribution when interpreting results and designing further experiments. Techniques such as data augmentation, class weighting, or resampling could be explored to mitigate these effects and improve performance on minority classes.

Conclusion

Deep Sets provides a simple, powerful, and theoretically grounded method for applying deep learning to unordered data. By starting from the fundamental requirement of permutation invariance, we can derive an architecture that is not only effective in practice—on tasks ranging from point cloud analysis to particle physics—but is also a cornerstone of the broader field of Geometric Deep Learning. It reminds us that understanding the structure and symmetry of our data is one of the most powerful tools we have for building better models.

References

Bronstein, Michael M., Joan Bruna, Taco Cohen, and Petar Veličković. 2021. “Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges.” arXiv. https://doi.org/10.48550/ARXIV.2104.13478.

Connolly, A. J., I. Csabai, A. S. Szalay, D. C. Koo, R. G. Kron, and J. A. Munn. 1995. “Slicing Through Multicolor Space: Galaxy Redshifts From Broadband Photometry.” https://doi.org/10.48550/ARXIV.ASTRO-PH/9508100.

Qi, Charles R., Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2016. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” arXiv. https://doi.org/10.48550/ARXIV.1612.00593.

Zaheer, Manzil, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. 2017. “Deep Sets.” arXiv. https://doi.org/10.48550/ARXIV.1703.06114.

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. “3D ShapeNets: A Deep Representation for Volumetric Shapes.” In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1912–20. Boston, MA, USA: IEEE. https://doi.org/10.1109/CVPR.2015.7298801.