Quick Start
pip install knowlyr-hub[all]
from trajectoryhub import Pipeline, PipelineConfig
pipeline = Pipeline(PipelineConfig(output_dir="./output"))
traj = pipeline.run_from_log("agent.jsonl", "openhands")
create_sandbox
Create a Docker sandbox execution environment
execute_tool
Execute a tool in the sandbox (file_read, file_write, shell, search, git)
reset_sandbox
Reset sandbox to initial state
replay_trajectory
Replay an Agent execution trajectory in the sandbox
sandbox_snapshot
Save sandbox current state snapshot (filesystem diff + environment info)
convert_logs
Convert Agent logs to standardized trajectory format
validate_logs
Validate if log files match a specified Agent framework format
get_schema
Return the JSON Schema definition for standardized trajectories
recorder_diff
Compare differences between two trajectories — step count, tool usage, success rate, etc.
score_trajectory
Compute process-level Reward for a single Agent trajectory
build_preferences
Build preference pairs from multiple trajectories (for RLHF/DPO training)
calibrate_reward
Calibrate automatic Reward against human annotations
list_rubrics
List available evaluation Rubric dimensions
reward_leaderboard
Generate reward leaderboard from multiple trajectories — rank by Reward score, compare models/strategies
run_pipeline
Run complete Agent trajectory data pipeline (Task -> Sandbox -> Recorder -> Reward -> Export)
export_dataset
Export trajectory data in specified training format (SFT / DPO / Benchmark / HuggingFace)
process_log
Process a single Agent log file, parse and score to generate standardized trajectory
process_logs_batch
Batch process Agent log directory, parse and score to generate standardized trajectories
pipeline_status
View pipeline execution status and progress
Documentation
knowlyr-gym
Gymnasium-Style Reinforcement Learning Framework
for LLM Agent Training
MDP Formalization · Three-Layer Process Reward Model · SFT / DPO / GRPO Policy Optimization
Formalized MDP environments, three-layer process reward, and complete policy optimization pipeline
Quick Start · Architecture · Key Innovations · Components · Ecosystem
What is knowlyr-gym?
knowlyr-gym is a training infrastructure for LLM Agents — not another inference framework. It answers three fundamental questions: where to train (Gymnasium-compatible environments), how to evaluate (three-layer Process Reward Model), and how to optimize (SFT / DPO / GRPO policy training). Environments produce trajectories, rewards assess quality, and trainers optimize policy — all connected through standardized data formats into a closed loop.
The framework formalizes LLM tool-use agent tasks as Markov Decision Processes $\langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ and implements the complete reinforcement learning pipeline from environment interaction to policy optimization.
Architecture
graph LR
subgraph MDP["MDP Environment Layer"]
ENV["AgentEnv<br/>reset() / step() / close()"]
TS["TimeStep<br/>observation · reward<br/>terminated · truncated"]
ENV --> TS
end
subgraph RL["RL Training Loop"]
PI["Policy π<br/>(LLM Agent)"]
COL["Rollout<br/>collect()"]
RM["Process Reward<br/>Model (PRM)"]
EXP["Dataset<br/>SFT / DPO / GRPO"]
OPT["Policy<br/>Optimization"]
end
PI -->|action| ENV
TS -->|observation| PI
COL -->|trajectories| RM
RM -->|scored trajectories| EXP
EXP --> OPT
OPT -->|updated π| PI
ENV -.->|wrappers| COL
style MDP fill:#1a1a2e,color:#e0e0e0,stroke:#444
style RL fill:#0d1b2a,color:#e0e0e0,stroke:#444
style PI fill:#0969da,color:#fff,stroke:#0969da
style RM fill:#8b5cf6,color:#fff,stroke:#8b5cf6
Key Innovations
Gymnasium-Compatible Environment Protocol
5 registered environments (knowlyr/sandbox, knowlyr/conversation, knowlyr/engineering, knowlyr/advisory, knowlyr/discussion) with 4 composable wrappers — extending the Gymnasium reset() / step() / close() pattern to LLM Agent scenarios with structured tool-call actions and natural language state spaces.
DomainProfile — Domain-Agnostic Abstraction
Declarative domain configuration covering toolsets, tool categories, outcome rules, and scoring dimension weights. 7 built-in domains (coding, browser, conversation, engineering, advisory, discussion, generic) — add new domains without modifying core code.
Three-Layer Process Reward Model
Step-level process reward $r_t = R(s_t, a_t)$ instead of sparse outcome reward. Three layers progressively improve evaluation quality:
| Layer | Method | Cost | Latency |
|---|---|---|---|
| Rule-based | Redundancy, regression, info utilization, efficiency | ~0 | <1ms |
| LLM-as-Judge | Rubric-based multi-dimensional semantic scoring | ~$0.01/step | ~1s |
| Human | Calibration via human annotations | Offline | Offline |
Policy Optimization — SFT / DPO / GRPO
Three methods spanning the full spectrum from behavioral cloning to online policy optimization, plus 6 agent-specific training enhancements: observation masking, step-weighted loss, trajectory chunking, curriculum learning, multi-turn formatting, and step-level GRPO.
Quick Start
from knowlyrcore import make
env = make("knowlyr/conversation")
ts = env.reset(task="Help user check order status")
while not ts.done:
action = my_agent(ts.observation) # π(a|s)
ts = env.step(action) # s', r, done
env.close()
pip install knowlyr-hub[all]
Components
| Package | RL Role | Description |
|---|---|---|
| knowlyr-core | MDP Protocol | AgentEnv · TimeStep · EnvWrapper · Registry · DomainProfile |
| knowlyr-sandbox | Environment | Docker sandbox execution · SandboxEnv · ConversationEnv |
| knowlyr-recorder | Trajectory Buffer | Agent log parsing · standardized trajectories · adapter registry |
| knowlyr-reward | Reward Model | Three-layer PRM · Rubric scoring · preference pair construction |
| knowlyr-hub | Rollout & Data | collect() sampling · DatasetExporter · Pipeline orchestration · CAS dedup · GDI ranking |
| knowlyr-trainer | Policy Optimization | SFT · DPO · GRPO · evaluation · inference bridge |
Ecosystem
| Layer | Project | Description |
|---|---|---|
| Discovery | AI Dataset Radar | Dataset competitive intelligence and trend analysis |
| Analysis | DataRecipe | Reverse engineering, schema extraction, cost estimation |
| Production | DataSynth / DataLabel | LLM batch synthesis / lightweight annotation |
| Quality | DataCheck | Rule validation, dedup detection, distribution analysis |
| Audit | ModelAudit | Distillation detection, model fingerprinting |
| Deliberation | Crew | Adversarial multi-agent deliberation · persistent memory evolution |
| Identity | knowlyr-id | Identity system + AI employee runtime |
| Agent Training | knowlyr-gym | Gymnasium-style RL framework for LLM agent training |
Want to discuss this project? Reach out to