Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
vadawainwright edited this page 6 months ago


I ran a fast experiment examining how DeepSeek-R1 performs on agentic tasks, despite not supporting tool use natively, and I was rather pleased by preliminary results. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions however also formulates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even bigger margin:

The experiment followed design use standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, avoid including a system timely, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can find further evaluation details here.

Approach

DeepSeek-R1's strong coding abilities enable it to act as an agent without being explicitly trained for tool use. By enabling the design to produce actions as Python code, it can flexibly connect with environments through code execution.

Tools are implemented as Python code that is consisted of in the prompt. This can be a simple function definition or a module of a bigger package - any valid Python code. The design then produces code actions that call these tools.

Results from performing these actions feed back to the model as follow-up messages, driving the next steps up until a last response is reached. The representative structure is a simple iterative coding loop that moderates the discussion in between the design and yogaasanas.science its environment.

Conversations

DeepSeek-R1 is used as chat design in my experiment, where the model autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing a search engine or bring data from websites. This drives the discussion with the environment that continues till a last answer is reached.

In contrast, o1 designs are known to carry out badly when used as chat models i.e. they don't try to pull context throughout a discussion. According to the linked short article, o1 designs perform best when they have the full context available, with clear directions on what to do with it.

Initially, I likewise tried a full context in a single timely method at each action (with results from previous actions included), but this led to significantly lower scores on the GAIA subset. Switching to the conversational technique explained above, I was able to reach the reported 65.6% performance.

This raises a fascinating question about the claim that o1 isn't a chat design - possibly this observation was more appropriate to older o1 designs that did not have tool use abilities? After all, isn't tool usage support a crucial system for making it possible for models to pull extra context from their environment? This conversational technique certainly seems reliable for DeepSeek-R1, though I still need to conduct similar try outs o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is exceptional that generalization to agentic tasks with tool use by means of code actions works so well. This capability to generalize to agentic jobs reminds of recent research by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't examined in that work.

Despite its capability to generalize to tool usage, DeepSeek-R1 frequently produces long reasoning traces at each step, compared to other models in my experiments, limiting the effectiveness of this model in a single-agent setup. Even easier tasks in some cases take a very long time to complete. Further RL on agentic tool use, be it through code actions or not, could be one choice to enhance efficiency.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design regularly changes between various reasoning thoughts without sufficiently checking out appealing paths to reach a right option. This was a significant reason for overly long reasoning traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.

Future experiments

Another typical application of thinking models is to use them for planning only, while utilizing other models for creating code actions. This could be a prospective brand-new function of freeact, if this separation of roles proves useful for more complex jobs.

I'm likewise curious about how thinking designs that already support tool use (like o1, o3, ...) perform in a single-agent setup, with and without generating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look interesting.