Gymnasium-style RL training framework — MDP formalization for modeling Agent decision processes, Process Reward Models (PRM) providing step-level feedback signals, with sandbox isolation, trajectory recording, and distributed pipeline orchestration.

MDP Modeling Process Reward Model Sandbox Isolation

Quick Start

Install

pip install knowlyr-hub[all]

Usage

from trajectoryhub import Pipeline, PipelineConfig

pipeline = Pipeline(PipelineConfig(output_dir="./output"))
traj = pipeline.run_from_log("agent.jsonl", "openhands")

MCP Tools

19 callable endpoints

+

create_sandbox 创建 Docker 沙箱执行环境

execute_tool 在沙箱中执行工具 (file_read, file_write, shell, search, git)

reset_sandbox 重置沙箱到初始状态

replay_trajectory 在沙箱中重放 Agent 执行轨迹

sandbox_snapshot 保存沙箱当前状态快照（文件系统 diff + 环境信息）

convert_logs 将 Agent 日志转换为标准化轨迹格式

validate_logs 验证日志文件是否为指定的 Agent 框架格式

get_schema 返回标准化轨迹的 JSON Schema 定义

recorder_diff 对比两条轨迹的差异 — 步骤数、工具使用、成功率等维度对比

score_trajectory 对单条 Agent 轨迹计算过程级 Reward

build_preferences 从多条轨迹构建偏好对 (用于 RLHF/DPO 训练)

calibrate_reward 将自动 Reward 与人工标注进行校准

list_rubrics 列出可用的评估 Rubric 维度

reward_leaderboard 从多条轨迹生成奖励排行榜 — 按 Reward 分数排序，对比不同模型/策略的表现

run_pipeline 运行完整的 Agent 轨迹数据 Pipeline (Task -> Sandbox -> Recorder -> Reward -> Export)

export_dataset 将轨迹数据导出为指定的训练格式 (SFT / DPO / Benchmark / HuggingFace)

process_log 处理单个 Agent 日志文件，解析并评分生成标准轨迹

process_logs_batch 批量处理 Agent 日志目录，解析并评分生成标准轨迹

pipeline_status 查看 Pipeline 执行状态和进度

Documentation

knowlyr-gym

Name: Knowlyr Gym
Author: Knowlyr

Gymnasium-Style Reinforcement Learning Framework
for LLM Agent Training

MDP Formalization · Three-Layer Process Reward Model · SFT / DPO / GRPO Policy Optimization

Formalized MDP environments, three-layer process reward, and complete policy optimization pipeline

Quick Start · Architecture · Key Innovations · Components · Ecosystem

What is knowlyr-gym?

knowlyr-gym is a training infrastructure for LLM Agents — not another inference framework. It answers three fundamental questions: where to train (Gymnasium-compatible environments), how to evaluate (three-layer Process Reward Model), and how to optimize (SFT / DPO / GRPO policy training). Environments produce trajectories, rewards assess quality, and trainers optimize policy — all connected through standardized data formats into a closed loop.

The framework formalizes LLM tool-use agent tasks as Markov Decision Processes $\langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ and implements the complete reinforcement learning pipeline from environment interaction to policy optimization.

Architecture

graph LR
    subgraph MDP["MDP Environment Layer"]
        ENV["AgentEnv<br/>reset() / step() / close()"]
        TS["TimeStep<br/>observation · reward<br/>terminated · truncated"]
        ENV --> TS
    end

    subgraph RL["RL Training Loop"]
        PI["Policy π<br/>(LLM Agent)"]
        COL["Rollout<br/>collect()"]
        RM["Process Reward<br/>Model (PRM)"]
        EXP["Dataset<br/>SFT / DPO / GRPO"]
        OPT["Policy<br/>Optimization"]
    end

    PI -->|action| ENV
    TS -->|observation| PI
    COL -->|trajectories| RM
    RM -->|scored trajectories| EXP
    EXP --> OPT
    OPT -->|updated π| PI
    ENV -.->|wrappers| COL

    style MDP fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style RL fill:#0d1b2a,color:#e0e0e0,stroke:#444
    style PI fill:#0969da,color:#fff,stroke:#0969da
    style RM fill:#8b5cf6,color:#fff,stroke:#8b5cf6

Key Innovations

Gymnasium-Compatible Environment Protocol

5 registered environments (knowlyr/sandbox, knowlyr/conversation, knowlyr/engineering, knowlyr/advisory, knowlyr/discussion) with 4 composable wrappers — extending the Gymnasium reset() / step() / close() pattern to LLM Agent scenarios with structured tool-call actions and natural language state spaces.

DomainProfile — Domain-Agnostic Abstraction

Declarative domain configuration covering toolsets, tool categories, outcome rules, and scoring dimension weights. 7 built-in domains (coding, browser, conversation, engineering, advisory, discussion, generic) — add new domains without modifying core code.

Three-Layer Process Reward Model

Step-level process reward $r_t = R(s_t, a_t)$ instead of sparse outcome reward. Three layers progressively improve evaluation quality:

Layer	Method	Cost	Latency
Rule-based	Redundancy, regression, info utilization, efficiency	~0	<1ms
LLM-as-Judge	Rubric-based multi-dimensional semantic scoring	~$0.01/step	~1s
Human	Calibration via human annotations	Offline	Offline

Policy Optimization — SFT / DPO / GRPO

Three methods spanning the full spectrum from behavioral cloning to online policy optimization, plus 6 agent-specific training enhancements: observation masking, step-weighted loss, trajectory chunking, curriculum learning, multi-turn formatting, and step-level GRPO.

Quick Start

from knowlyrcore import make

env = make("knowlyr/conversation")
ts = env.reset(task="Help user check order status")
while not ts.done:
    action = my_agent(ts.observation)   # π(a|s)
    ts = env.step(action)              # s', r, done
env.close()

pip install knowlyr-hub[all]

Components

Package	RL Role	Description
knowlyr-core	MDP Protocol	`AgentEnv` · `TimeStep` · `EnvWrapper` · `Registry` · `DomainProfile`
knowlyr-sandbox	Environment	Docker sandbox execution · `SandboxEnv` · `ConversationEnv`
knowlyr-recorder	Trajectory Buffer	Agent log parsing · standardized trajectories · adapter registry
knowlyr-reward	Reward Model	Three-layer PRM · Rubric scoring · preference pair construction
knowlyr-hub	Rollout & Data	`collect()` sampling · `DatasetExporter` · Pipeline orchestration · CAS dedup · GDI ranking
knowlyr-trainer	Policy Optimization	SFT · DPO · GRPO · evaluation · inference bridge

Ecosystem

Layer	Project	Description
Discovery	AI Dataset Radar	Dataset competitive intelligence and trend analysis
Analysis	DataRecipe	Reverse engineering, schema extraction, cost estimation
Production	DataSynth / DataLabel	LLM batch synthesis / lightweight annotation
Quality	DataCheck	Rule validation, dedup detection, distribution analysis
Audit	ModelAudit	Distillation detection, model fingerprinting
Deliberation	Crew	Adversarial multi-agent deliberation · persistent memory evolution
Identity	knowlyr-id	Identity system + AI employee runtime
Agent Training	knowlyr-gym	Gymnasium-style RL framework for LLM agent training