Open Source Python
Knowlyr Gym

Knowlyr Gym

RL Training Framework

★ 3 ⑂ 0 Updated 2026-03-15
Gymnasium-style RL training framework — MDP formalization for modeling Agent decision processes, Process Reward Models (PRM) providing step-level feedback signals, with sandbox isolation, trajectory recording, and distributed pipeline orchestration.
MDP Modeling Process Reward Model Sandbox Isolation

Quick Start

Install
pip install knowlyr-hub[all]
Usage
from trajectoryhub import Pipeline, PipelineConfig

pipeline = Pipeline(PipelineConfig(output_dir="./output"))
traj = pipeline.run_from_log("agent.jsonl", "openhands")
create_sandbox Create a Docker sandbox execution environment
execute_tool Execute a tool in the sandbox (file_read, file_write, shell, search, git)
reset_sandbox Reset sandbox to initial state
replay_trajectory Replay an Agent execution trajectory in the sandbox
sandbox_snapshot Save sandbox current state snapshot (filesystem diff + environment info)
convert_logs Convert Agent logs to standardized trajectory format
validate_logs Validate if log files match a specified Agent framework format
get_schema Return the JSON Schema definition for standardized trajectories
recorder_diff Compare differences between two trajectories — step count, tool usage, success rate, etc.
score_trajectory Compute process-level Reward for a single Agent trajectory
build_preferences Build preference pairs from multiple trajectories (for RLHF/DPO training)
calibrate_reward Calibrate automatic Reward against human annotations
list_rubrics List available evaluation Rubric dimensions
reward_leaderboard Generate reward leaderboard from multiple trajectories — rank by Reward score, compare models/strategies
run_pipeline Run complete Agent trajectory data pipeline (Task -> Sandbox -> Recorder -> Reward -> Export)
export_dataset Export trajectory data in specified training format (SFT / DPO / Benchmark / HuggingFace)
process_log Process a single Agent log file, parse and score to generate standardized trajectory
process_logs_batch Batch process Agent log directory, parse and score to generate standardized trajectories
pipeline_status View pipeline execution status and progress

Documentation

knowlyr-gym

Gymnasium-Style Reinforcement Learning Framework
for LLM Agent Training

MDP Formalization · Three-Layer Process Reward Model · SFT / DPO / GRPO Policy Optimization

Formalized MDP environments, three-layer process reward, and complete policy optimization pipeline

Quick Start · Architecture · Key Innovations · Components · Ecosystem

What is knowlyr-gym?

knowlyr-gym is a training infrastructure for LLM Agents — not another inference framework. It answers three fundamental questions: where to train (Gymnasium-compatible environments), how to evaluate (three-layer Process Reward Model), and how to optimize (SFT / DPO / GRPO policy training). Environments produce trajectories, rewards assess quality, and trainers optimize policy — all connected through standardized data formats into a closed loop.

The framework formalizes LLM tool-use agent tasks as Markov Decision Processes $\langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ and implements the complete reinforcement learning pipeline from environment interaction to policy optimization.

Architecture

graph LR
    subgraph MDP["MDP Environment Layer"]
        ENV["AgentEnv<br/>reset() / step() / close()"]
        TS["TimeStep<br/>observation · reward<br/>terminated · truncated"]
        ENV --> TS
    end

    subgraph RL["RL Training Loop"]
        PI["Policy π<br/>(LLM Agent)"]
        COL["Rollout<br/>collect()"]
        RM["Process Reward<br/>Model (PRM)"]
        EXP["Dataset<br/>SFT / DPO / GRPO"]
        OPT["Policy<br/>Optimization"]
    end

    PI -->|action| ENV
    TS -->|observation| PI
    COL -->|trajectories| RM
    RM -->|scored trajectories| EXP
    EXP --> OPT
    OPT -->|updated π| PI
    ENV -.->|wrappers| COL

    style MDP fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style RL fill:#0d1b2a,color:#e0e0e0,stroke:#444
    style PI fill:#0969da,color:#fff,stroke:#0969da
    style RM fill:#8b5cf6,color:#fff,stroke:#8b5cf6

Key Innovations

Gymnasium-Compatible Environment Protocol

5 registered environments (knowlyr/sandbox, knowlyr/conversation, knowlyr/engineering, knowlyr/advisory, knowlyr/discussion) with 4 composable wrappers — extending the Gymnasium reset() / step() / close() pattern to LLM Agent scenarios with structured tool-call actions and natural language state spaces.

DomainProfile — Domain-Agnostic Abstraction

Declarative domain configuration covering toolsets, tool categories, outcome rules, and scoring dimension weights. 7 built-in domains (coding, browser, conversation, engineering, advisory, discussion, generic) — add new domains without modifying core code.

Three-Layer Process Reward Model

Step-level process reward $r_t = R(s_t, a_t)$ instead of sparse outcome reward. Three layers progressively improve evaluation quality:

Layer Method Cost Latency
Rule-based Redundancy, regression, info utilization, efficiency ~0 <1ms
LLM-as-Judge Rubric-based multi-dimensional semantic scoring ~$0.01/step ~1s
Human Calibration via human annotations Offline Offline

Policy Optimization — SFT / DPO / GRPO

Three methods spanning the full spectrum from behavioral cloning to online policy optimization, plus 6 agent-specific training enhancements: observation masking, step-weighted loss, trajectory chunking, curriculum learning, multi-turn formatting, and step-level GRPO.

Quick Start

from knowlyrcore import make

env = make("knowlyr/conversation")
ts = env.reset(task="Help user check order status")
while not ts.done:
    action = my_agent(ts.observation)   # π(a|s)
    ts = env.step(action)              # s', r, done
env.close()
pip install knowlyr-hub[all]

Components

Package RL Role Description
knowlyr-core MDP Protocol AgentEnv · TimeStep · EnvWrapper · Registry · DomainProfile
knowlyr-sandbox Environment Docker sandbox execution · SandboxEnv · ConversationEnv
knowlyr-recorder Trajectory Buffer Agent log parsing · standardized trajectories · adapter registry
knowlyr-reward Reward Model Three-layer PRM · Rubric scoring · preference pair construction
knowlyr-hub Rollout & Data collect() sampling · DatasetExporter · Pipeline orchestration · CAS dedup · GDI ranking
knowlyr-trainer Policy Optimization SFT · DPO · GRPO · evaluation · inference bridge

Ecosystem

Layer Project Description
Discovery AI Dataset Radar Dataset competitive intelligence and trend analysis
Analysis DataRecipe Reverse engineering, schema extraction, cost estimation
Production DataSynth / DataLabel LLM batch synthesis / lightweight annotation
Quality DataCheck Rule validation, dedup detection, distribution analysis
Audit ModelAudit Distillation detection, model fingerprinting
Deliberation Crew Adversarial multi-agent deliberation · persistent memory evolution
Identity knowlyr-id Identity system + AI employee runtime
Agent Training knowlyr-gym Gymnasium-style RL framework for LLM agent training
knowlyr — gymnasium-style RL framework for LLM agent training

Want to discuss this project? Reach out to

Kai
Kai Founder & CEO
赵云帆
赵云帆 AI Backend Engineer