Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a fast experiment investigating how DeepSeek-R1 carries out on agentic jobs, regardless of not supporting tool use natively, and pl.velo.wiki I was rather amazed by initial results. This DeepSeek-R1 in a single-agent setup, where the model not just plans the actions but also develops the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% right, and other designs by an even larger margin:

The experiment followed model use standards from the DeepSeek-R1 paper and the model card: Don't use few-shot examples, avoid adding a system timely, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can find additional assessment details here.

Approach

DeepSeek-R1's strong coding capabilities enable it to serve as an agent without being explicitly trained for tool usage. By permitting the design to generate actions as Python code, it can flexibly connect with environments through code execution.

Tools are executed as Python code that is included straight in the prompt. This can be a simple function meaning or a module of a bigger bundle - any legitimate Python code. The design then generates code actions that call these tools.

Results from performing these actions feed back to the model as follow-up messages, driving the next actions up until a last answer is reached. The agent structure is an easy iterative coding loop that mediates the discussion in between the design and its environment.

Conversations

DeepSeek-R1 is utilized as chat design in my experiment, where the design autonomously pulls additional context from its environment by using tools e.g. by utilizing an online search engine or fetching information from web pages. This drives the conversation with the environment that continues till a final response is reached.

In contrast, o1 models are known to carry out improperly when used as chat models i.e. they do not try to pull context throughout a discussion. According to the linked short article, o1 models carry out best when they have the complete context available, with clear directions on what to do with it.

Initially, I likewise attempted a full context in a single prompt technique at each action (with arise from previous actions consisted of), but this caused considerably lower scores on the GAIA subset. Switching to the conversational technique explained above, I was able to reach the reported 65.6% efficiency.

This raises an intriguing question about the claim that o1 isn't a chat model - possibly this observation was more appropriate to older o1 models that did not have tool use capabilities? After all, isn't tool use support an important system for allowing models to pull additional context from their environment? This conversational technique certainly appears effective for DeepSeek-R1, though I still need to carry out comparable try outs o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is remarkable that generalization to agentic tasks with tool usage by means of code actions works so well. This ability to generalize to agentic tasks advises of current research study by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't investigated in that work.

Despite its capability to generalize to tool usage, DeepSeek-R1 often produces long reasoning traces at each action, compared to other models in my experiments, restricting the usefulness of this model in a single-agent setup. Even simpler tasks in some cases take a long time to finish. Further RL on agentic tool use, be it through code actions or not, might be one alternative to enhance efficiency.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design often switches in between various reasoning ideas without sufficiently exploring promising courses to reach a proper option. This was a major reason for overly long reasoning traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.

Future experiments

Another typical application of thinking models is to utilize them for planning just, while utilizing other designs for creating code actions. This could be a possible new feature of freeact, wiki.die-karte-bitte.de if this separation of functions proves helpful for more complex tasks.

I'm likewise curious about how thinking models that already support tool use (like o1, o3, ...) carry out in a single-agent setup, freechat.mytakeonit.org with and without generating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise utilizes code actions, wiki-tb-service.com look interesting.