Custom rl loss patch 1 by mzio · Pull Request #3 · thinking-machines-lab/tinker

mzio · 2025-10-10T05:43:56Z

Main issue: High-level, there seems to be a conflict between how a user would specify an RL loss and the required Datum loss_fn_inputs, and how this gets processed in training_client (where it expects supervised learning loss_fn_inputs).

tinker/src/tinker/lib/public_interfaces/training_client.py

Lines 259 to 274 in 9ba155a

    
           @capture_exceptions(fatal=True) 
        
           async def forward_backward_custom_async( 
        
               self, data: List[types.Datum], loss_fn: CustomLossFnV1 
        
           ) -> APIFuture[types.ForwardBackwardOutput]: 
        
               import torch 
        
               # First do a forward pass and get logprobs 
        
               forward_future = await self.forward_async(data, "cross_entropy") 
        
               forward_result = await forward_future.result_async() 
        
               logprobs_list: List[torch.Tensor] = [] 
        
               for out in forward_result.loss_fn_outputs: 
        
                   logprob = torch.tensor(out["logprobs"].data).clone().detach().requires_grad_(True) 
        
                   logprobs_list.append(logprob) 
        
               # Now apply user-provided function 
        
               loss, metrics = loss_fn(data, logprobs_list)

Solution here is to instead create a separate copy of the Datum list, where each element now has a weights keys in loss_fn_inputs. We infer 0 vs 1 based on whether advantages in the original datum list are 0 or not (maybe bad heuristic).

We can then use the copy to compute logprobs as currently, while then applying the user custom loss_fn to the original data, e.g.,:

# convert
_data_for_xent = list(map(convert_to_cross_entropy_datum, data))

# get on-policy logprobs
forward_future = await self.forward_async(_data_for_xent, "cross_entropy")
forward_result = await forward_future.result_async()
logprobs_list: List[torch.Tensor] = []
      for out in forward_result.loss_fn_outputs:
          logprob = torch.tensor(out["logprobs"].data).clone().detach().requires_grad_(True)
          logprobs_list.append(logprob)

# apply user-provided function (on original data list)
loss, metrics = loss_fn(data, logprobs_list)

mzio added 2 commits October 9, 2025 21:44

fix custom loss by changing Datums to have cross_entropy inputs

eeebea0

fix custom loss by changing Datums to have cross_entropy inputs

54a34e8

This was referenced Oct 10, 2025

Custom rl loss patch 2 (with key detection) #4

Open

custom RL losses may be borked? #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom rl loss patch 1#3

Custom rl loss patch 1#3
mzio wants to merge 2 commits intothinking-machines-lab:mainfrom
mzio:custom_rl_loss_patch_1

mzio commented Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	@capture_exceptions(fatal=True)
	async def forward_backward_custom_async(
	self, data: List[types.Datum], loss_fn: CustomLossFnV1
	) -> APIFuture[types.ForwardBackwardOutput]:
	import torch

	# First do a forward pass and get logprobs
	forward_future = await self.forward_async(data, "cross_entropy")
	forward_result = await forward_future.result_async()
	logprobs_list: List[torch.Tensor] = []
	for out in forward_result.loss_fn_outputs:
	logprob = torch.tensor(out["logprobs"].data).clone().detach().requires_grad_(True)
	logprobs_list.append(logprob)

	# Now apply user-provided function
	loss, metrics = loss_fn(data, logprobs_list)

Conversation

mzio commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mzio commented Oct 10, 2025 •

edited

Loading