Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a quick experiment examining how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool use natively, sciencewiki.science and I was quite amazed by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not just prepares the actions however also creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even larger margin:

The experiment followed design use standards from the DeepSeek-R1 paper and the model card: Don't use few-shot examples, avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was used). You can discover additional evaluation details here.

Approach

DeepSeek-R1's strong coding abilities allow it to function as a representative without being explicitly trained for tool usage. By enabling the design to produce actions as Python code, it can flexibly engage with environments through code execution.

Tools are carried out as Python code that is consisted of straight in the prompt. This can be a basic function meaning or a module of a larger bundle - any legitimate Python code. The design then produces code actions that call these tools.

Results from performing these actions feed back to the model as follow-up messages, driving the next actions till a final answer is reached. The agent framework is a simple iterative coding loop that mediates the discussion between the design and lespoetesbizarres.free.fr its .

Conversations

DeepSeek-R1 is utilized as chat design in my experiment, where the design autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing a search engine or fetching information from websites. This drives the discussion with the environment that continues up until a last response is reached.

In contrast, o1 designs are known to carry out poorly when used as chat models i.e. they do not try to pull context during a conversation. According to the linked article, o1 models perform best when they have the complete context available, with clear directions on what to do with it.

Initially, I also attempted a full context in a single timely approach at each action (with outcomes from previous steps consisted of), however this caused considerably lower ratings on the GAIA subset. Switching to the conversational approach explained above, I was able to reach the reported 65.6% performance.

This raises an interesting question about the claim that o1 isn't a chat design - perhaps this observation was more pertinent to older o1 designs that did not have tool use abilities? After all, isn't tool use support a crucial system for allowing models to pull additional context from their environment? This conversational technique certainly appears reliable for DeepSeek-R1, fishtanklive.wiki though I still require to perform comparable try outs o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding jobs, it is impressive that generalization to agentic jobs with tool usage via code actions works so well. This capability to generalize to agentic jobs advises of current research study by DeepMind that reveals that RL generalizes whereas SFT remembers, although generalization to tool use wasn't examined in that work.

Despite its capability to generalize to tool usage, DeepSeek-R1 frequently produces really long reasoning traces at each step, compared to other models in my experiments, restricting the usefulness of this model in a single-agent setup. Even simpler tasks in some cases take a very long time to finish. Further RL on agentic tool usage, be it through code actions or not, could be one choice to improve efficiency.

Underthinking

I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model frequently switches between various thinking thoughts without adequately exploring appealing paths to reach an appropriate solution. This was a significant reason for excessively long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.

Future experiments

Another common application of thinking models is to utilize them for planning only, while using other models for producing code actions. This might be a prospective brand-new feature of freeact, if this separation of functions shows beneficial for more complex jobs.

I'm likewise curious about how reasoning models that currently support tool usage (like o1, o3, ...) carry out in a single-agent setup, with and without creating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also utilizes code actions, look fascinating.