PSEC

Abstract

Humans excel at reusing prior knowledge to address new challenges and developing skills while solving problems. This paradigm becomes increasingly popular in the development of autonomous agents, as it develops systems that can self-evolve in response to new challenges like human beings. However, previous methods suffer from limited training efficiency when expanding new skills and fail to fully leverage prior knowledge to facilitate new task learning. In this paper, we propose Parametric Skill Expansion and Composition (PSEC), a new framework designed to iteratively evolve the agents' capabilities and efficiently address new challenges by maintaining a manageable skill library. This library can progressively integrate skill primitives as plug-and-play Low-Rank Adaptation (LoRA) modules in parameter-efficient finetuning, facilitating efficient and flexible skill expansion. This structure also enables the direct skill compositions in parameter space by merging LoRA modules that encode different skills, leveraging shared information across skills to effectively program new skills. Based on this, we propose a context-aware module to dynamically activate different skills to collaboratively handle new tasks. Empowering diverse applications including multi-objective composition, dynamics shift, and continual policy shift, the results on D4RL, DSRL benchmarks, and the DeepMind Control Suite show that PSEC exhibits superior capacity to leverage prior knowledge to efficiently tackle new challenges, as well as expand its skill libraries to evolve the capabilities.

Problem Setups

We propose PSEC, a generic framework that can efficiently reuse prior knowledge and self-evolve to address new tasks.

We consider a Markov Decision Process (MDP) where \( s \in \mathcal{S} \) and \( a \in \mathcal{A} \) are the state and action spaces, respectively. The transition dynamics are defined as \( \mathcal{P}: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(\mathcal{S}) \), and the reward function is \( r: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} \). We assume that the state space \( \mathcal{S} \) and action space \( \mathcal{A} \) remain unchanged during training, which is a mild assumption in many relevant works. We consider an agent with \( \pi_0 \) as its initial policy, which is progressively tasked with new tasks \( \mathcal{T}_i \) for \( i = 1, 2, \dots \). These tasks may differ in their rewards \( r \) or dynamics \( \mathcal{P} \), reflecting real-world scenarios where non-stationary dynamics or new challenges continually emerge. Each task is provided with several expert demonstrations \( \mathcal{D}^{\mathcal{T}_i}_{e} := \{(s, a)\} \) or a mixed-quality dataset with reward labels \( \mathcal{D}^{\mathcal{T}_i}_{o} := \{(s, a, r_i, s')\} \). Consequently, we can use either offline reinforcement learning (RL) or imitation learning (IL) to adapt to these new challenges. Inspired by previous works, we maintain a policy library \( \Pi \) to store the policies associated with different tasks, as shown in Figure 1(a). Our goal is to utilize prior knowledge to enable efficient policy learning and gradually expand \( \Pi \) to incorporate new abilities throughout the training process.

\[ \Pi = \{\pi_0, \pi_1, \pi_2, \pi_3, \dots\}. \]

Efficient Expansion: How to manage the skill library \(\Pi\) to learn new skills in an efficient and manageable way;
Efficient Composition: How to fully utilize the prior knowledge from primitives in the skill set \(\Pi\) to tackle the emerging challenges.

Efficient Policy Expansion via Low-Rank Adaptation

We consider a pretrained policy \( \pi_0 \) and denote \( \mathit{W}_0 \in \mathbb{R}^{d_{\rm in} \times d_{\rm out}} \) as its associated weight matrix. Directly finetuning \( W_0 \) to adapt to new skills might be extremely inefficient. Instead, we introduce a tune-able LoRA module \( \Delta W \) upon \( W_0 \), i.e., \( \mathit{W}_0 + \Delta \mathit{W} = \mathit{W}_0 + BA \), to perform the adaptation while keeping \( \mathit{W}_0 \) frozen. Here, \( B \in \mathbb{R}^{d_{\rm in} \times n} \), \( A \in \mathbb{R}^{n \times d_{\rm out}} \), and \( n \ll \min(d_{\rm in}, d_{\rm out}) \). Specifically, the input feature of the linear layer is denoted as \( h_{\rm in} \in \mathbb{R}^{d_{\rm in}} \), and the output feature of the linear layer is \( h_{\rm out} \in \mathbb{R}^{d_{\rm out}} \). The final output of a LoRA-augmented layer can be calculated through the following forward process:

\[ h_{\rm out} = (\mathit{W}_0 + \alpha \Delta \mathit{W}) h_{\rm in} = (\mathit{W}_0 + \alpha BA) h_{\rm in} = \mathit{W}_0 h_{\rm in} + \alpha BA h_{\rm in}. \]

where \( \alpha \) is a weight to balance the pre-trained model and LoRA modules. This operation naturally prevents catastrophic forgetting in a parameter isolation approach, and the low-rank decomposition structure of \( A \) and \( B \) significantly reduces the computational burden. Benefiting from this lightweight characteristic, we can manage numerous LoRA modules \( \{\Delta \mathit{W}_i = B_i A_i \mid i \in 1, 2, \dots, k\} \) to encode different skill primitives \( \pi_i \), respectively, as shown in Figure 2. This flexible approach allows us to easily integrate new skills based on existing knowledge, while also facilitating library management by removing suboptimal primitives and retaining the effective ones. More importantly, by adjusting the value of \( \alpha \), it holds the potential to interpolate the pretrained skill in \( W_0 \) and other primitives in \( \Delta W_i \) to generate novel skills, as shown in Figure 3.

\[ W = W_0 + \sum_{i=1}^{k} \alpha_i \Delta W_i = W_0 + \sum_{i=1}^{k} \alpha_i B_i A_i, \]

where \( \alpha_i \) is the weight to interpolate pre-trained weights and LoRA modules. This interpolation property has been explored in fields like text-to-image generation and language modeling, but its application in decision-making scenarios remains highly underexplored, despite LoRA's proven efficacy in skill acquisition. Next, we will elaborate on how to effectively combine LoRA modules to adapt to decision-making applications.

Figure 2: Learning new skills using LoRA modules.

Figure 3: Interpolation in LoRA modules.

Context-aware Composition in Parameter Space

We propose a simple yet effective context-aware composition method that adaptively leverages pretrained knowledge to optimally address the encountering tasks according to the agent's current context. Specifically, we introduce a context-aware modular \( \alpha(s;\theta) \in \mathbb{R}^k \) with \( \alpha_i \) as its \( i \)-th dimension. The composition method can be expressed by the following equation:

\[ \begin{split} W({\theta}) = W_0 + \sum_{i=1}^{k} \alpha_i(s;\theta) \Delta W_i = W_0 + \sum_{i=1}^{k} \alpha_i(s;\theta) B_i A_i. \end{split} \]

Here, \( \alpha(s;\theta) \) adaptively adjusts output weights based on the agent's current situation \( s \), with the parameter \( \theta \) optimized via minimizing the diffusion loss.

Comparison between parameter-, noise-, and action-level composition, as shown in Figure 4.

Figure 4: Parameter-level composition offers more flexibility to leverage the shared or complementary structure across skills to compose new skills. Noise- and action-level composition, however, is too late to benefit from this information.
To evaluate the advantages of parameter-level composition over other levels of composition, we employ t-SNE to project the output features of LoRA modules into a 2D space, alongside the noise and generated actions of various skills, as shown in Figure 5.

Figure 5: t-SNE projections of samples from different skills in parameter, noise, and action space. The parameter space exhibits a good structure for skill composition, where skills share common knowledge while retaining their unique features to avoid confusion. Noise and action spaces are either too noisy to clearly distinguish between skills or fail to capture the shared structure across them.

To achieve effective skill sharing, two primitive paradigms are introduced, including Hard Parameter Sharing and Soft Parameter Sharing, we compare the architecture of PSEC with other multitask learning frameworks, as shown in Figure 6.

Figure 6: Illustrative comparisons between PSEC and other modularized multitask learning frameworks when deployed to continual learning settings.

PSEC enjoys remarkable versatility across various scenarios since many problems can be resolved by reusing pre-trained policies and gradually evolving its capabilities during training. Thus, we present a comprehensive evaluation across diverse scenarios, including multi-objective composition, policy learning under policy shifts and dynamics shifts, to answer the following questions:

Can the context-aware modular effectively compose different skills?
Can our parameter-level composition outperform noise- and action-level compositions?
Can the introduction of LoRA modules enhance training and sample efficiency?
Can PSEC framework iteratively evolve after incorporating more skills?

Multi-objective Composition

In many real-world applications, a complex task can be decomposed into simpler objectives, where collaboratively combining these atomic skills can tackle the complex task. In this setting, we aim to evaluate the advantages of parameter-level composition over other levels of composition in Figure 4, and the effectiveness of the context-aware modular. We consider one practical multi-objective composition scenario within the safe offline RL domain. This setting requires solving a constrained Markov Decision Process (MDP) to tackle a complex trilogy objective: avoiding distributional shift, maximizing rewards, and meanwhile minimizing costs. These objectives can conflict, thus requiring a nuanced composition to optimize performance effectively.

SOTA Performance on DSRL

Table 1: Normalized DSRL benchmark results. Costs below 1 indicate safety.
Results are averaged over 20 evaluation episodes and 4 seeds.
Bold: Safe agents with costs below 1. Blue: Safe agents achieving the highest reward.
\( \uparrow \): the higher the better. \( \downarrow \): the lower the better.

Visualization for Context-aware Modular

Figure 7: Output weights of context-aware modular evaluated on the MetaDrive-easymean task. The network dynamically adjusts the weights to handle real-time demands: It prioritizes safety policies when the vehicle approaches obstacles or navigates a turn while avoiding boundary lines. When no obstacles are present and the task is simply driving straight, it shifts focus toward maximizing rewards and maintain some safety insurance.

Continual Policy Shift Setting

Figure 8: Continual evolution on DeepMind Control Suite for Continual policy shift.

We evaluate another practical scenario where the agent is progressively tasked with new tasks. We aim to continuously expand the skill libraries to test if the capabilities of agents to learn new skills can be gradually enhanced as prior knowledge grows and test the efficiency of LoRA.

Training and sample efficiency: To demonstrate the training and sample efficiency of PSEC, we conduct extensive evaluations across varying numbers of trajectories and different methods. 9 shows that PSEC achieves superior sample efficiency across different training sample sizes, particularly when data is scarce (e.g., only 10 trajectories). Figure 10 shows that PSEC can quickly attain excellent performance even without composition, highlighting the effectiveness of the LoRA modules. Hence, we train less than 50k gradient steps for almost all tasks, while previous methods typically require millions of gradient steps and data to obtain reasonable results.
Continual Evolution: Table 2 shows that PSEC effectively leverages prior knowledge to facilitate efficient policy learning given solely limited data. Notably, \( \text{S+W} \rightarrow \text{R} \) outperforms \( \text{S} \rightarrow \text{R} \), demonstrating that the learning capability of PSEC gradually evolves as the skill library grows. In contrast, training from scratch or replacing the LoRA modules with MLP fails to learn new skills given limited data, highlighting the effectiveness of both utilizing prior knowledge and the introduction of LoRA to efficiently adapt to new skills and self-evolution. Moreover, note that even PSEC (MLP) outperforms NSEC and ASEC, further highlighting the advantages of parameter-level compositions.
Context-aware Composition v.s. Fixed Composition: We carefully tune the fixed composition (w/o CA) of different skills during composition. However, Figure 11 shows that the context-aware modular can consistently outperform the fixed ones across different levels of compositions. This demonstrates the advantages of the context-aware composition network to fully leverage the prior knowledge in the skill library to enable efficient policy adaptations.

Figure 9: Sample efficiency.

Figure 10: Training efficiency.

Figure 11: Context-aware efficiency.

Table 2: Results in the policy shift setting. \( \text{S} \), \( \text{W} \), and \( \text{R} \) denote stand, walk, and run, respectively. 10 trajectories are provided for \( \text{W} \) and \( \text{R} \) tasks.

Dynamics Shift Setting

Figure 12: The illustration of the source and target domains for the dynamic shift setting.

We evaluate PSEC in another practical setting to further validate its versatility, where the dynamics \( \mathcal{P} \) shift to encompass diverse scenarios such as cross-embodiment, sim-to-real transfer, and policy learning in non-stationary environments.

Figure 13: Results in the dynamics shift setting over 10 episodes and 5 seeds. The suffixes -m, -mr, and -me refer to \( \mathcal{D}_o^{\mathcal{P}_1} \) sampled from medium, medium-replay, and medium-expert-v2 data in D4RL, respectively.

Ablation Study

We primarily ablate on different LoRA ranks \(n\) to assess the robustness of our methods in continual policy shift setting. Figure 14 demonstrates that under varied LoRA \(n\) ranks, PSEC consistently outperforms the MLP variant across various LoRA ranks, demonstrating the superior robustness of LoRA modules. Among the different rank settings, we observe that {\(n=8\)} yields the best results, thus is opted as the default choice for the experiments. We hypothesize using rank larger than 8 degenerates is due to the training data being quite limited (e.g. only 10 demonstrations)

Figure 14: Ablations on LoRA ranks.

BibTeX


        @inproceedings{
        liu2025psec,
        title={Skill Expansion and Composition in Parameter Space},
        author={Tenglong Liu, Jianxiong Li, Yinan Zheng, Haoyi Niu, Yixing Lan, Xin Xu, Xianyuan Zhan},
        booktitle={International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=GLWf2fq0bX}
        }