Skip to content

Possible VRAM Control Issue #234

@alan-he-494165

Description

@alan-he-494165

I was running the following code to optimise 500 Zn crystal structures

state = ts.initialize_state(compound_init, device=device, dtype=torch.float64)
state_list = state.split()
memory_metric_values = [
    calculate_memory_scaler(s, memory_scales_with="n_atoms_x_density") for s in state_list
]
max_memory_metric = estimate_max_memory_scaler(
    mace_model, state_list, metric_values=memory_metric_values
)
print("Max memory metric", max_memory_metric)

batcher = InFlightAutoBatcher(
    mace_model,
    max_memory_padding=1,
    max_memory_scaler=max_memory_metric * 0.8
)

convergence_fn = ts.generate_force_convergence_fn(0.025)
relaxed_state = ts.optimize(
    system=state,
    model=mace_model,
    optimizer=ts.frechet_cell_fire,
    autobatcher=batcher,
    max_steps=1000,
    convergence_fn=convergence_fn,
)

however error below was observed

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 22
     15 batcher = InFlightAutoBatcher(
     16     mace_model,
     17     max_memory_padding=1,
     18     max_memory_scaler=max_memory_metric * 0.8
     19 )
     21 convergence_fn = ts.generate_force_convergence_fn(0.025)
---> 22 relaxed_state = ts.optimize(
     23     system=state,
     24     model=mace_model,
     25     optimizer=ts.frechet_cell_fire,
     26     autobatcher=batcher,
     27     max_steps=1000,
     28     convergence_fn=convergence_fn,
     29 )
     30 # extract the final energy from the trajectory file
     31 print(relaxed_state.energy)

File [~/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/runners.py:439](http://localhost:8895/lab/workspaces/auto-P/tree/Alan_H/Torch_sim_test/Notebooks/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/runners.py#line=438), in optimize(system, model, optimizer, convergence_fn, trajectory_reporter, autobatcher, max_steps, steps_between_swaps, pbar, **optimizer_kwargs)
    436     pbar_kwargs.setdefault("disable", None)
    437     tqdm_pbar = tqdm(total=state.n_batches, **pbar_kwargs)
--> 439 while (result := autobatcher.next_batch(state, convergence_tensor))[0] is not None:
    440     state, converged_states, batch_indices = result
    441     all_converged_states.extend(converged_states)

File [~/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/autobatching.py:1062](http://localhost:8895/lab/workspaces/auto-P/tree/Alan_H/Torch_sim_test/Notebooks/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/autobatching.py#line=1061), in InFlightAutoBatcher.next_batch(self, updated_state, convergence_tensor)
   1060 if updated_state.n_batches > 0:
   1061     next_states = [updated_state, *next_states]
-> 1062 next_batch = concatenate_states(next_states)
   1064 if self.return_indices:
   1065     return next_batch, completed_states, self.current_idx

File [~/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/state.py:839](http://localhost:8895/lab/workspaces/auto-P/tree/Alan_H/Torch_sim_test/Notebooks/miniconda3/envs/X/lib/python3.11/site-packages/torch_sim/state.py#line=838), in concatenate_states(states, device)
    836 # Concatenate collected tensors
    837 for prop, tensors in per_atom_tensors.items():
    838     # if tensors:
--> 839     concatenated[prop] = torch.cat(tensors, dim=0)
    841 for prop, tensors in per_batch_tensors.items():
    842     # if tensors:
    843     concatenated[prop] = torch.cat(tensors, dim=0)

TypeError: expected Tensor as element 1 in argument 0, but got NoneType

This sort of type error may be common when cuda memory was heavily used, and nvidia-smi supported this may be the reasoning.

As I have tried, when I added VRAM cleaning commands when manually written all the batching and optimisation cycles as introduced in the tutorial and using subprcess for optimisation when needed - understanding the autobatching, this problem would resolve at this calculation scale. Also, the convergence becomes easier for smaller scale calculation, interestingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggeo-optGeometry optimization

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions