Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,10 @@ venv.bak/
dmypy.json

# aim
*.aim*
*.aim*

# Datasets
data/*

# Generated images
samples/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding so generic folders to the gitignore. Are these only created when running the examples? In that case I'd either leave it up to the developer not to commit these or put them e.g., in examples/_data/ and examples/_samples/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this directory is only created when running the examples. I wasn’t planning to add it to the .gitignore either, but I included it to get your input during the review. I’ll remove it then — thanks for the feedback!

276 changes: 275 additions & 1 deletion dwave/plugins/torch/models/boltzmann_machine.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
spread = AggregatedSamples.spread


__all__ = ["GraphRestrictedBoltzmannMachine"]
__all__ = ["GraphRestrictedBoltzmannMachine", "RestrictedBoltzmannMachine"]


class GraphRestrictedBoltzmannMachine(torch.nn.Module):
Expand Down Expand Up @@ -662,3 +662,277 @@ def estimate_beta(self, spins: torch.Tensor) -> float:
bqm = BinaryQuadraticModel.from_ising(*self.to_ising(1))
beta = 1 / mple(bqm, (spins.detach().cpu().numpy(), self._nodes))[0]
return beta

class RestrictedBoltzmannMachine(torch.nn.Module):
Copy link
Collaborator

@kevinchern kevinchern Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make RBM extend GraphRestrictedBoltzmannMachine? e.g.,

def __init__(self, n_vis, n_hid):
    bipartite_graph = ...
    super().__init__(nodes=bipartite_graph.nodes, edges=..., hidden_nodes=...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in one of our ML tools meeting, we decided to keep these two separate to make the RBM as efficient as possible. There is no need to materialize a graph for RBM. I would be happy to discuss this in a meeting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was the case early on in the development. Per yesterday's discussion, one requirement of this implementation is to have it be a drop-in replacement for other GRBM models.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember agreeing on this :) Let's discuss this in a meeting.

"""A Restricted Boltzmann Machine (RBM) model.

This class defines the parameterization and inference of a binary RBM.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the {-1,1} convention instead of the {0,1} convention. I understand that {-1,1} is way more common in D-Wave's code. Am I wrong? @thisac

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. Happy to hear opinions on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both Vlad and Kevin suggested this, I would look into it.

Training is performed using Persistent Contrastive Divergence (PCD).

Args:
n_visible (int): Number of visible units.
n_hidden (int): Number of hidden units.
"""

def __init__(
self,
n_visible: int,
n_hidden: int,
) -> None:
super().__init__()

# Model hyperparameters
self._n_visible = n_visible
self._n_hidden = n_hidden

# Initialize model parameters
# initialize weights
self._weights = torch.nn.Parameter(
0.1 * torch.randn(n_visible, n_hidden)
)
# initialize visible units biases.
self._visible_biases = torch.nn.Parameter(
0.5 * torch.ones(n_visible)
)
# initialize hidden units biases.
self._hidden_biases = torch.nn.Parameter(
0.5 * torch.ones(n_hidden)
)
Comment on lines +689 to +701
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the 0.1, 0.5 and 0.5 values come from? Should they be hard-coded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE Where do the 0.1, 0.5 and 0.5 values come from? Should they be hard-coded?

I set those values arbitrarily for GRBM. There should be a better initialization scheme for RBMs, e.g., section 8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values worked for the image generation example. Sure, I can experiment with 0.01 as suggested in the guide and will let you know how it affects the performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found the initialisation of the GRBM to be not great for my experiments, so I've had to pass initial linear and quadratic weights. @kevinchern in your experience, have you had to do the same? If so, should we change the default initialisation?

Copy link
Collaborator

@kevinchern kevinchern Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VolodyaCO good point. I've had similar experiences and found setting initial weights to 0 to be robust in general. Could you create an issue for this?

edit: actually i'll do it now

edit 2: here's the issue. please add more details as u see fit #48


# Stores the last visible states to initialize the Markov chain in Persistent Contrastive Divergence (PCD)
self.register_buffer("_previous_visible_values", None)

# Initialize momenta tensors for momentum-based updates (all start at 0)
self.register_buffer("_weight_momenta", torch.zeros(n_visible, n_hidden))
self.register_buffer("_visible_bias_momenta", torch.zeros(n_visible))
self.register_buffer("_hidden_bias_momenta", torch.zeros(n_hidden))

@property
def n_visible(self) -> int:
"""Number of visible units."""
return self._n_visible

@property
def n_hidden(self) -> int:
"""Number of hidden units."""
return self._n_hidden

@property
def weights(self) -> torch.Tensor:
"""Weights of the RBM."""
return self._weights

@property
def visible_biases(self) -> torch.Tensor:
"""Visible biases of the RBM."""
return self._visible_biases

@property
def hidden_biases(self) -> torch.Tensor:
"""Hidden biases of the RBM."""
return self._hidden_biases

@property
def previous_visible_values(self) -> torch.Tensor:
"""Previous visible values used in Persistent Contrastive Divergence (PCD)."""
return self._previous_visible_values

@property
def weight_momenta(self) -> torch.Tensor:
"""Weight momenta of the RBM."""
return self._weight_momenta

@property
def visible_bias_momenta(self) -> torch.Tensor:
"""Visible bias momenta of the RBM."""
return self._visible_bias_momenta

@property
def hidden_bias_momenta(self) -> torch.Tensor:
"""Hidden bias momenta of the RBM."""
return self._hidden_bias_momenta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these useful attributes to access for a user? Consider removing some of these properties if they're only used within the class and not useful for a general user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I added them because I thought this would be consistent with the GRBM. Sure, I'll keep the first two then.



def _sample_hidden(self, visible: torch.Tensor) -> torch.Tensor:
"""Sample from the distribution P(h|v).

Args:
visible (torch.Tensor): Tensor of shape (batch_size, n_visible)
representing the states of visible units.

Returns:
torch.Tensor: Binary tensor of shape (batch_size, n_hidden) representing
sampled hidden units.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args should have indented linebreaks, returns should not.

Suggested change
Args:
visible (torch.Tensor): Tensor of shape (batch_size, n_visible)
representing the states of visible units.
Returns:
torch.Tensor: Binary tensor of shape (batch_size, n_hidden) representing
sampled hidden units.
Args:
visible (torch.Tensor): Tensor of shape (batch_size, n_visible)
representing the states of visible units.
Returns:
torch.Tensor: Binary tensor of shape (batch_size, n_hidden) representing
sampled hidden units.

"""
hidden_probs = torch.sigmoid(self._hidden_biases + visible @ self._weights)
return torch.bernoulli(hidden_probs)

def _sample_visible(self, hidden: torch.Tensor) -> torch.Tensor:
"""Sample from the distribution P(v|h).

Args:
hidden (torch.Tensor): Tensor of shape (batch_size, n_hidden)
representing the states of hidden units.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hidden (torch.Tensor): Tensor of shape (batch_size, n_hidden)
representing the states of hidden units.
hidden (torch.Tensor): Tensor of shape (batch_size, n_hidden)
representing the states of hidden units.

Returns:
torch.Tensor: Binary tensor of shape (batch_size, n_visible) representing
sampled visible units.
"""
visible_probs = torch.sigmoid(self._visible_biases + hidden @ self._weights.t())
return torch.bernoulli(visible_probs)

def generate_sample(
self,
batch_size: int,
gibbs_steps: int,
start_visible: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named just sample instead, to conform with the GRBM class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If start_visible != None then batch_size isn't required, right? You could make that optional as well unless there's a reasonable default value to use (e.g., batch_size=1).

Similarly, would it makes sense having gibbs_setps default to 1? I noticed that a test was using

hidden = RBM._sample_hidden()

which could in that case be written as

_, hidden = RBM.generate_sample()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding batch_size, you're right. I'll make it optional.
Regarding gibbs_steps, I’d prefer not to set a default value. I want users to make an explicit choice rather than unknowingly relying on a default of 1 (as often times we need more steps for our experiments). That test example just shows using 1 step with generate_sample is like generating with one _sample_hidden call.

"""Generate a sample of visible and hidden units using gibbs sampling.

Args:
batch_size (int): Number of samples to generate.
gibbs_steps (int): Number of Gibbs sampling steps to perform.
start_visible (torch.Tensor | None, optional): Initial visible states to
start the Gibbs chain (shape: [batch_size, n_visible]). If None,
a random Gaussian initialization is used.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Returns:
tuple[torch.Tensor, torch.Tensor]: A tuple of (visible, hidden) from the last Gibbs step:
- visible: (batch_size, n_visible)
- hidden: (batch_size, n_hidden)
"""
if start_visible is None:
visible_values = torch.randn(
batch_size, self.n_visible, device=self._weights.device
)
else:
visible_values = start_visible

hidden_values = None

for _ in range(gibbs_steps):
hidden_values = self._sample_hidden(visible_values)
visible_values = self._sample_visible(hidden_values)

return visible_values, hidden_values

def _contrastive_divergence(
self,
batch: torch.Tensor,
epoch: int,
n_gibbs_steps: int,
learning_rate: float,
momentum_coefficient: float,
weight_decay: float,
n_epochs: int,
) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a public method. Also, from the docstrings, it was difficult for me to infer how to use this while training because of epoch and n_epochs (it isn't clear how this information is used: to compute a decayed learning rate). I think it would be a good addition to have an example in the docstring, something like

for epoch in range(n_epochs):
  for batch in dataloader:
    rbm.contrastive_divergence(batch, epoch, n_gibbs_steps, learning_rate, momentum_coefficient, weight_decay, n_epochs)

Copy link
Collaborator Author

@anahitamansouri anahitamansouri Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I will make it public. And, I guess it's good to add this to the docstring or what about referring to the example in rbm_image_generation.py where it's used in a real example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have both references, like a quick use-it-like-this, and a reference to the image generation example too.

"""
Perform one step of Contrastive Divergence (CD-k) with momentum and weight decay.
Uses Persistent Contrastive Divergence (PCD) by maintaining the last visible states
for Gibbs sampling across batches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
Perform one step of Contrastive Divergence (CD-k) with momentum and weight decay.
Uses Persistent Contrastive Divergence (PCD) by maintaining the last visible states
for Gibbs sampling across batches.
"""Perform one step of Contrastive Divergence (CD-k) with momentum and weight decay.
Uses Persistent Contrastive Divergence (PCD) by maintaining the last visible states
for Gibbs sampling across batches.


Args:
batch (torch.Tensor): A batch of input data of shape (batch_size, n_visible).
epoch (int): Current training epoch.
n_gibbs_steps (int): Number of Gibbs sampling steps per epoch.
learning_rate (float): Base learning rate for parameter updates.
momentum_coefficient (float): Momentum coefficient for parameter updates.
weight_decay (float): weight decay (L2 regularization) coefficient for weights.
n_epochs (int): Number of training epochs.

Returns:
torch.Tensor: The reconstruction error (L1 norm) for the batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

"""

# Positive phase (data-driven)
hidden_probs = torch.sigmoid(self._hidden_biases + batch @ self._weights)

weight_grads = torch.matmul(batch.t(), hidden_probs)
visible_bias_grads = batch
hidden_bias_grads = hidden_probs

batch_size = batch.size(0)

# Initialize previous visible states for Persistent CD
if self._previous_visible_values == None:
self._previous_visible_values = torch.randn_like(
batch, device=self._weights.device
)

# Negative phase (model-driven)
# Sample from the model using gibbs sampling
visible_values, hidden_values = self.generate_sample(
batch_size, n_gibbs_steps, self._previous_visible_values
)

visible_values = visible_values.detach()
hidden_values = hidden_values.detach()
# Store samples to initialize the next Markov chain with (PCD)
self._previous_visible_values = visible_values

# Compute the gradients for negative phase
weight_grads -= torch.matmul(visible_values.t(), hidden_values)

visible_bias_grads -= visible_values
hidden_bias_grads -= hidden_values

# Average across the batch
weight_grads /= batch_size
visible_bias_grads /= batch_size
hidden_bias_grads /= batch_size

# Compute decayed learning rate
decayed_learning_rate = learning_rate - (learning_rate / n_epochs * epoch)

# Update momenta
self._weight_momenta = self._weight_momenta * momentum_coefficient + decayed_learning_rate * weight_grads
self._visible_bias_momenta = self._visible_bias_momenta * momentum_coefficient + decayed_learning_rate * torch.sum(
visible_bias_grads, dim=0
)
self._hidden_bias_momenta = self._hidden_bias_momenta * momentum_coefficient + decayed_learning_rate * torch.sum(
hidden_bias_grads, dim=0
)

with torch.no_grad():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all calculations in this method be wrapped in a no grad context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the only part that is necessary to be in torch.no_grad is the parameters updates part. The rest can also be in it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't sampling visible from hidden and hidden from visible also trigger gradient tracking?

# Update parameters
self._weights += self._weight_momenta
self._visible_biases += self._visible_bias_momenta
self._hidden_biases += self._hidden_bias_momenta

# Apply weight decay
self._weights -= decayed_learning_rate * self._weights * weight_decay

# Compute reconstruction error (L1 norm)
reconstruction = self._sample_visible(self._sample_hidden(batch))
reconstruction = reconstruction.detach()
error = torch.sum(torch.abs(batch - reconstruction))

return error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but just error makes it sound like an actual error is being returned, not data.

Suggested change
error = torch.sum(torch.abs(batch - reconstruction))
return error
reconstruction_error = torch.sum(torch.abs(batch - reconstruction))
return reconstruction_error


def forward(self, visible: torch.Tensor) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the same as the GRBM.

Suggested change
def forward(self, visible: torch.Tensor) -> torch.Tensor:
def forward(self, x: torch.Tensor) -> torch.Tensor:

"""
Computes the RBM free energy of a batch of visible units averaged over the batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
Computes the RBM free energy of a batch of visible units averaged over the batch.
"""Computes the RBM free energy of a batch of visible units averaged over the batch.


The free energy F(visible) for a visible vector visible is:

F(visible) = - visible · visible_biases
- sum_{j=1}^{n_hidden} log(1 + exp(hidden_biases[j] + (visible · weights)_j))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
F(visible) = - visible · visible_biases
- sum_{j=1}^{n_hidden} log(1 + exp(hidden_biases[j] + (visible · weights)_j))
.. math::
F(visible) = - visible · visible_biases
- sum_{j=1}^{n_hidden} log(1 + exp(hidden_biases[j] + (visible · weights)_j))


Args:
visible (torch.Tensor): Tensor of shape (batch_size, n_visible) representing the visible layer.

Returns:
torch.Tensor: Scalar tensor representing the **average free energy** over the batch.
"""

v_term = (visible * self._visible_biases).sum(dim=1)

hidden_pre_activation = visible @ self._weights + self._hidden_biases

h_term = torch.sum(torch.nn.functional.softplus(hidden_pre_activation), dim=1)

free_energy_per_sample = -v_term - h_term

# average over batch
return free_energy_per_sample.mean()
Loading