Open Source Python

Knowlyr Gym

RL Training Framework

★ 3 ⑂ 0 Updated 2026-02-26
Gymnasium-style RL training framework — MDP formalization for modeling Agent decision processes, Process Reward Models (PRM) providing step-level feedback signals, with sandbox isolation, trajectory recording, and distributed pipeline orchestration.
MDP Modeling Process Reward Model Sandbox Isolation

Quick Start

Install
pip install knowlyr-hub[all]
Usage
from trajectoryhub import Pipeline, PipelineConfig

pipeline = Pipeline(PipelineConfig(output_dir="./output"))
traj = pipeline.run_from_log("agent.jsonl", "openhands")
create_sandbox 创建 Docker 沙箱执行环境
execute_tool 在沙箱中执行工具 (file_read, file_write, shell, search, git)
reset_sandbox 重置沙箱到初始状态
replay_trajectory 在沙箱中重放 Agent 执行轨迹
sandbox_snapshot 保存沙箱当前状态快照(文件系统 diff + 环境信息)
convert_logs 将 Agent 日志转换为标准化轨迹格式
validate_logs 验证日志文件是否为指定的 Agent 框架格式
get_schema 返回标准化轨迹的 JSON Schema 定义
recorder_diff 对比两条轨迹的差异 — 步骤数、工具使用、成功率等维度对比
score_trajectory 对单条 Agent 轨迹计算过程级 Reward
build_preferences 从多条轨迹构建偏好对 (用于 RLHF/DPO 训练)
calibrate_reward 将自动 Reward 与人工标注进行校准
list_rubrics 列出可用的评估 Rubric 维度
reward_leaderboard 从多条轨迹生成奖励排行榜 — 按 Reward 分数排序,对比不同模型/策略的表现
run_pipeline 运行完整的 Agent 轨迹数据 Pipeline (Task -> Sandbox -> Recorder -> Reward -> Export)
export_dataset 将轨迹数据导出为指定的训练格式 (SFT / DPO / Benchmark / HuggingFace)
process_log 处理单个 Agent 日志文件,解析并评分生成标准轨迹
process_logs_batch 批量处理 Agent 日志目录,解析并评分生成标准轨迹
pipeline_status 查看 Pipeline 执行状态和进度

Documentation

knowlyr-gym

Gymnasium-Style Reinforcement Learning Framework
for LLM Agent Training

MDP Formalization · Three-Layer Process Reward Model · SFT / DPO / GRPO Policy Optimization

Formalized MDP environments, three-layer process reward, and complete policy optimization pipeline

Quick Start · Architecture · Key Innovations · Components · Ecosystem

What is knowlyr-gym?

knowlyr-gym is a training infrastructure for LLM Agents — not another inference framework. It answers three fundamental questions: where to train (Gymnasium-compatible environments), how to evaluate (three-layer Process Reward Model), and how to optimize (SFT / DPO / GRPO policy training). Environments produce trajectories, rewards assess quality, and trainers optimize policy — all connected through standardized data formats into a closed loop.

The framework formalizes LLM tool-use agent tasks as Markov Decision Processes $\langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ and implements the complete reinforcement learning pipeline from environment interaction to policy optimization.

Architecture

graph LR
    subgraph MDP["MDP Environment Layer"]
        ENV["AgentEnv<br/>reset() / step() / close()"]
        TS["TimeStep<br/>observation · reward<br/>terminated · truncated"]
        ENV --> TS
    end

    subgraph RL["RL Training Loop"]
        PI["Policy π<br/>(LLM Agent)"]
        COL["Rollout<br/>collect()"]
        RM["Process Reward<br/>Model (PRM)"]
        EXP["Dataset<br/>SFT / DPO / GRPO"]
        OPT["Policy<br/>Optimization"]
    end

    PI -->|action| ENV
    TS -->|observation| PI
    COL -->|trajectories| RM
    RM -->|scored trajectories| EXP
    EXP --> OPT
    OPT -->|updated π| PI
    ENV -.->|wrappers| COL

    style MDP fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style RL fill:#0d1b2a,color:#e0e0e0,stroke:#444
    style PI fill:#0969da,color:#fff,stroke:#0969da
    style RM fill:#8b5cf6,color:#fff,stroke:#8b5cf6

Key Innovations

Gymnasium-Compatible Environment Protocol

5 registered environments (knowlyr/sandbox, knowlyr/conversation, knowlyr/engineering, knowlyr/advisory, knowlyr/discussion) with 4 composable wrappers — extending the Gymnasium reset() / step() / close() pattern to LLM Agent scenarios with structured tool-call actions and natural language state spaces.

DomainProfile — Domain-Agnostic Abstraction

Declarative domain configuration covering toolsets, tool categories, outcome rules, and scoring dimension weights. 7 built-in domains (coding, browser, conversation, engineering, advisory, discussion, generic) — add new domains without modifying core code.

Three-Layer Process Reward Model

Step-level process reward $r_t = R(s_t, a_t)$ instead of sparse outcome reward. Three layers progressively improve evaluation quality:

Layer Method Cost Latency
Rule-based Redundancy, regression, info utilization, efficiency ~0 <1ms
LLM-as-Judge Rubric-based multi-dimensional semantic scoring ~$0.01/step ~1s
Human Calibration via human annotations Offline Offline

Policy Optimization — SFT / DPO / GRPO

Three methods spanning the full spectrum from behavioral cloning to online policy optimization, plus 6 agent-specific training enhancements: observation masking, step-weighted loss, trajectory chunking, curriculum learning, multi-turn formatting, and step-level GRPO.

Quick Start

from knowlyrcore import make

env = make("knowlyr/conversation")
ts = env.reset(task="Help user check order status")
while not ts.done:
    action = my_agent(ts.observation)   # π(a|s)
    ts = env.step(action)              # s', r, done
env.close()
pip install knowlyr-hub[all]

Components

Package RL Role Description
knowlyr-core MDP Protocol AgentEnv · TimeStep · EnvWrapper · Registry · DomainProfile
knowlyr-sandbox Environment Docker sandbox execution · SandboxEnv · ConversationEnv
knowlyr-recorder Trajectory Buffer Agent log parsing · standardized trajectories · adapter registry
knowlyr-reward Reward Model Three-layer PRM · Rubric scoring · preference pair construction
knowlyr-hub Rollout & Data collect() sampling · DatasetExporter · Pipeline orchestration · CAS dedup · GDI ranking
knowlyr-trainer Policy Optimization SFT · DPO · GRPO · evaluation · inference bridge

Ecosystem

Layer Project Description
Discovery AI Dataset Radar Dataset competitive intelligence and trend analysis
Analysis DataRecipe Reverse engineering, schema extraction, cost estimation
Production DataSynth / DataLabel LLM batch synthesis / lightweight annotation
Quality DataCheck Rule validation, dedup detection, distribution analysis
Audit ModelAudit Distillation detection, model fingerprinting
Deliberation Crew Adversarial multi-agent deliberation · persistent memory evolution
Identity knowlyr-id Identity system + AI employee runtime
Agent Training knowlyr-gym Gymnasium-style RL framework for LLM agent training
knowlyr — gymnasium-style RL framework for LLM agent training

Want to discuss this project? Reach out to

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
赵云帆" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
赵云帆 AI 后端工程师