Skip to content

Performance & Scalability

Strategies for running Copybara efficiently on large codebases.

Large repositories require more memory. Set the heap size with -Xmx:

Terminal window
# Default (may be insufficient for large repos)
java -jar copybara.jar migrate copy.bara.sky
# 4GB heap for medium repos (10k-50k files)
java -Xmx4g -jar copybara.jar migrate copy.bara.sky
# 8GB heap for large repos (50k-200k files)
java -Xmx8g -jar copybara.jar migrate copy.bara.sky
# 16GB+ for very large repos (200k+ files)
java -Xmx16g -jar copybara.jar migrate copy.bara.sky
  • OutOfMemoryError: Java heap space
  • Process killed by OOM killer (check dmesg)
  • Unexplained slowdowns during transformation phase

File matching is often the biggest performance factor.

# SLOW: Scans everything, then filters
origin_files = glob(
include = ["**"],
exclude = ["vendor/**", "node_modules/**", "*.generated.*"],
)
# FAST: Only scans needed directories
origin_files = glob(["src/**", "lib/**", "docs/**"])

Copybara calculates “roots” from glob patterns to determine which directories to query:

PatternRoot Queried
src/**/*.pysrc/
pkg/api/**pkg/api/
**/*.java/ (entire repo)
*.md/ (root only)

Patterns starting with ** force full repository traversal.

Always exclude directories you don’t need:

origin_files = glob(
include = ["**"],
exclude = [
"node_modules/**",
"vendor/**",
".git/**",
"build/**",
"dist/**",
"target/**",
"**/*.min.js",
"**/*.map",
],
)

For CI environments, use shallow clones when you only need recent history:

Terminal window
# Clone with limited depth
git clone --depth=100 https://github.com/org/repo
# Then run Copybara with explicit last-rev
java -jar copybara.jar migrate copy.bara.sky --last-rev HEAD~50

For very large monorepos, use Git’s partial clone:

Terminal window
# Clone without blobs initially
git clone --filter=blob:none https://github.com/org/repo
# Configure sparse checkout
git sparse-checkout init --cone
git sparse-checkout set src/component docs/

When running multiple Copybara workflows against the same origin, use reference repos:

Terminal window
# Create a shared reference
git clone --bare https://github.com/org/repo /shared/repo.git
# Copybara workflows reference it (reduces network I/O)
git clone --reference /shared/repo.git https://github.com/org/repo

Don’t process entire history on every run:

Terminal window
# Only process last 100 commits
java -jar copybara.jar migrate copy.bara.sky --last-rev HEAD~100
# Start from specific known-good commit
java -jar copybara.jar migrate copy.bara.sky --last-rev abc123def

When importing many commits, SQUASH mode is faster than ITERATIVE:

core.workflow(
name = "import",
mode = "SQUASH", # Combines all changes into one commit
# ...
)

ITERATIVE mode processes each commit separately, which is slower but preserves history.

Group related transformations to minimize file I/O:

# Less efficient: Multiple passes over files
transformations = [
core.replace(before = "old1", after = "new1", paths = glob(["**/*.java"])),
core.replace(before = "old2", after = "new2", paths = glob(["**/*.java"])),
core.replace(before = "old3", after = "new3", paths = glob(["**/*.java"])),
]
# More efficient: Combined patterns where possible
transformations = [
core.replace(
before = "com.old.package",
after = "com.new.package",
paths = glob(["**/*.java"]),
regex_groups = {"package": ".*"},
),
]

Cache the Copybara JAR and Git repos between runs:

# GitHub Actions example
- uses: actions/cache@v4
with:
path: |
~/.copybara
~/.m2/repository
key: copybara-${{ hashFiles('copy.bara.sky') }}

Run independent workflows in parallel:

jobs:
sync-component-a:
runs-on: ubuntu-latest
steps:
- run: java -jar copybara.jar migrate copy.bara.sky component-a
sync-component-b:
runs-on: ubuntu-latest
steps:
- run: java -jar copybara.jar migrate copy.bara.sky component-b

Only run Copybara when relevant files change:

on:
push:
paths:
- "src/**"
- "docs/**"
- "copy.bara.sky"

Enable detailed logging for performance analysis:

Terminal window
java -jar copybara.jar migrate copy.bara.sky -v

The verbose output includes timing for each phase:

  • Git operations (clone, fetch, push)
  • File matching
  • Transformation execution
  • Commit creation

For persistent performance issues, enable JVM profiling:

Terminal window
java -XX:+FlightRecorder \
-XX:StartFlightRecording=duration=300s,filename=copybara.jfr \
-jar copybara.jar migrate copy.bara.sky

Analyze with jfr command or Java Mission Control.

Split large repos into independent workflows:

workflow-frontend.bara.sky
core.workflow(
name = "frontend",
origin_files = glob(["frontend/**"]),
# ...
)
# workflow-backend.bara.sky
core.workflow(
name = "backend",
origin_files = glob(["backend/**"]),
# ...
)

For high-frequency syncs, implement rate limiting:

Terminal window
# Only sync if last sync was > 5 minutes ago
if [ $(find /tmp/last-sync -mmin -5 2>/dev/null) ]; then
echo "Skipping: synced recently"
exit 0
fi
touch /tmp/last-sync
java -jar copybara.jar migrate copy.bara.sky

Typical performance on modern hardware (8-core, 32GB RAM):

Repo SizeFilesFirst SyncIncremental
Small1-1,000~30s~5s
Medium1k-10k~2m~15s
Large10k-100k~10m~1m
Very Large100k+~30m+~5m