Python Analysis
Analyzing reproducible experiments often requires locating specific output files buried within hashed directory structures. repx-py abstracts this complexity, allowing users to query jobs by name, parameters, or dependency relationships and retrieve their outputs as standard Python objects or pandas DataFrames.
Installation
repx-py is available as a flake package. Include it in your project's development shell:
# flake.nix
{
inputs.repx.url = "github:repx-org/repx";
outputs = { self, nixpkgs, repx }: {
devShells.x86_64-linux.default = nixpkgs.legacyPackages.x86_64-linux.mkShell {
packages = [
repx.packages.x86_64-linux.repx
];
};
};
}
Or build it directly:
nix build github:repx-org/repx#repx
Loading an Experiment
The Experiment class is the entry point. It loads the Lab metadata and allows you to query runs and jobs.
from repx_py import Experiment
# Initialize from the built lab directory
exp = Experiment(lab_path="./result")
print(f"Loaded experiment with {len(exp.jobs())} total jobs")
Querying Jobs
The JobCollection interface allows you to filter jobs based on their metadata and parameters.
1. Basic Filtering
# Get all jobs in the 'simulation' run
jobs = exp.jobs().filter(name__startswith="simulation")
# Filter by exact parameter match
jobs = exp.jobs().filter(param_model="resnet50")
2. Advanced Filtering
Supported operators:
__startswith__endswith__containsparam_<NAME>: Search within effective parameters.
# Find jobs where learning rate is 0.01 AND model name contains 'net'
target_jobs = exp.jobs().filter(
param_learning_rate=0.01,
param_model__contains="net"
)
3. Converting to DataFrame
You can convert a collection of jobs into a Pandas DataFrame to inspect their metadata and parameters in tabular format.
df = target_jobs.to_dataframe()
print(df)
# Output:
# name param_learning_rate param_model
# job_id
# 8f2a... simulation.train-model-batch-1 0.01 resnet
Accessing Results
Once you have a JobView (by iterating over a collection or getting a specific ID), you can access its outputs.
for job in target_jobs:
print(f"Analyzing Job: {job.id}")
# 1. Get absolute path to an output file
log_path = job.get_output_path("run.log")
# 2. Load CSV data directly
# (Assumes the stage defined an output named "metrics.csv")
metrics_df = job.load_csv("metrics.csv")
print(metrics_df.describe())
# 3. Load JSON data
config = job.load_json("config.json")
Effective Parameters
RepX resolves the "effective parameters" for every job by tracing values inherited from upstream dependencies. This means you always know exactly what configuration produced a result, even if parameters were defined in a producer stage.
# Access resolved parameters
print(job.effective_params)
Analysis within a Pipeline
When running an analysis stage inside a RepX pipeline, you don't have access to the full Lab directory (because it's being built!). Instead, you use the from_run_metadata factory method.
Inside analysis.py:
import argparse
from repx_py import Experiment
parser = argparse.ArgumentParser()
parser.add_argument("--meta", help="Path to input run metadata")
parser.add_argument("--store", help="Path to artifact store base")
args = parser.parse_args()
# Load context from the specific upstream run
exp = Experiment.from_run_metadata(args.meta, args.store)
# Now you can query the upstream jobs as usual
jobs = exp.jobs()