ci/official/README.md
# Official CI Directory
Maintainer: TensorFlow and TensorFlow DevInfra
Issue Reporting: File an issue against this repo and tag
[@devinfra](https://github.com/orgs/tensorflow/teams/devinfra)
********************************************************************************
## TensorFlow's Official CI and Build/Test Scripts
TensorFlow's official CI jobs run the scripts in this folder. Our internal CI
system, Kokoro, schedules our CI jobs by combining a build script with a file
from the `envs` directory that is filled with configuration options:
- Nightly jobs (Run nightly on the `nightly` branch)
- Uses `wheel.sh`, `libtensorflow.sh`, `code_check_full.sh`
- Continuous jobs (Run on every GitHub commit)
- Uses `pycpp.sh`
- Presubmit jobs (Run on every GitHub PR)
- Uses `pycpp.sh`, `code_check_changed_files.sh`
These "env" files match up with an environment matrix that roughly covers:
- Different Python versions
- Linux, MacOS, and Windows machines (these pool definitions are internal)
- x86 and arm64
- CPU-only, or with NVIDIA CUDA support (Linux only), or with TPUs
## How to Test Your Changes to TensorFlow
You may check how your changes will affect TensorFlow by:
1. Creating a PR and observing the presubmit test results
2. Running the CI scripts locally, as explained below
3. **Google employees only**: Google employees can use an internal-only tool
called "MLCI" that makes testing more convenient: it can execute any full CI job
against a pending change. Search for "MLCI" internally to find it.
You may invoke a CI script of your choice by following these instructions:
```bash
cd tensorflow-git-dir
# Here is a single-line example of running a script on Linux to build the
# GPU version of TensorFlow for Python 3.12, using the public TF bazel cache and
# a local build cache:
TFCI=py312,linux_x86_cuda,public_cache,disk_cache ci/official/wheel.sh
# First, set your TFCI variable to choose the environment settings.
# TFCI is a comma-separated list of filenames from the envs directory, which
# are all settings for the scripts. TF's CI jobs are all made of a combination
# of these env files.
#
# If you've clicked on a test result from our CI (via a dashboard or GitHub link),
# click to "Invocation Details" and find BUILD_CONFIG, which will contain a TFCI
# value in the "env_vars" list that you can choose to copy that environment.
# Ex. 1: TFCI=py311,linux_x86_cuda,nightly_upload (nightly job)
# Ex. 2: TFCI=py39,linux_x86,rbe (continuous job)
# Non-Googlers should replace "nightly_upload" or "rbe" with
# "public_cache,disk_cache".
# Googlers should replace "nightly_upload" with "public_cache,disk_cache" or
# "rbe", if you have set up your system to use RBE (see further below).
#
# Here is how to choose your TFCI value:
# 1. A Python version must come first, because other scripts reference it.
# Ex. py39 -- Python 3.9
# Ex. py310 -- Python 3.10
# Ex. py311 -- Python 3.11
# Ex. py312 -- Python 3.12
# 2. Choose the platform, which corresponds to the version of TensorFlow to
# build. This should also match the system you're using--you cannot build
# the TF MacOS package from Linux.
# Ex. linux_x86 -- x86_64 Linux platform
# Ex. linux_x86_cuda -- x86_64 Linux platform, with Nvidia CUDA support
# Ex. macos_arm64 -- arm64 MacOS platform
# 3. Add modifiers. Some modifiers for local execution are:
# Ex. disk_cache -- Use a local cache
# Ex. public_cache -- Use TF's public cache (read-only)
# Ex. public_cache_push -- Use TF's public cache (read and write, Googlers only)
# Ex. rbe -- Use RBE for faster builds (Googlers only; see below)
# Ex. no_docker -- Disable docker on enabled platforms
# See full examples below for more details on these. Some other modifiers are:
# Ex. versions_upload -- for TF official release versions
# Ex. nightly_upload -- for TF nightly official builds; changes version numbers
# Ex. no_upload -- Disable all uploads, usually for temporary CI issues
# Recommended: use a local+remote cache.
#
# Bazel will cache your builds in tensorflow/build_output/cache,
# and will also try using public build cache results to speed up
# your builds. This usually saves a lot of time, especially when
# re-running tests. However, note that:
#
# - New environments like new CUDA versions, changes to manylinux,
# compilers, etc. can cause undefined behavior such as build failures
# or tests passing incorrectly.
# - Automatic LLVM updates are known to extend build time even with
# the cache; this is unavoidable.
export TFCI=py311,linux_x86,public_cache,disk_cache
# Recommended: Configure Docker. (Linux only)
#
# TF uses hub.docker.com/r/tensorflow/build containers for CI,
# and scripts on Linux create a persistent container called "tf"
# which mounts your TensorFlow directory into the container.
#
# Important: because the container is persistent, you cannot change TFCI
# variables in between script executions. To forcibly remove the
# container and start fresh, run "docker rm -f tf". Removing the container
# destroys some temporary bazel data and causes longer builds.
#
# You will need the NVIDIA Container Toolkit for GPU testing:
# https://github.com/NVIDIA/nvidia-container-toolkit
#
# Note: if you interrupt a bazel command on docker (ctrl-c), you
# will need to run `docker exec tf pkill bazel` to quit bazel.
#
# Note: new files created from the container are owned by "root".
# You can run e.g. `docker exec tf chown -R $(id -u):$(id -g) build_output`
# to transfer ownership to your user.
#
# Docker is enabled by default on Linux. You may disable it if you prefer:
# export TFCI=py311,linux_x86,no_docker
# Advanced: Use Remote Build Execution (RBE) (internal developers only)
#
# RBE dramatically speeds up builds and testing. It also gives you a
# public URL to share your build results with collaborators. However,
# it is only available to a limited set of internal TensorFlow developers.
#
# RBE is incompatible with local caching, so you must remove
# disk_cache, public_cache, and public_cache_push from your $TFCI file.
#
# To use RBE, you must first run `gcloud auth application-default login`, then:
export TFCI=py311,linux_x86,rbe
# Finally: Run your script of choice.
# If you've clicked on a test result from our CI (via a dashboard or GitHub link),
# click to "Invocation Details" and find BUILD_CONFIG, which will contain a
# "build_file" item that indicates the script used.
ci/official/wheel.sh
# Advanced: Select specific build/test targets with "any.sh".
# TF_ANY_TARGETS=":your/target" TF_ANY_MODE="test" ci/official/any.sh
# Afterwards: Examine the results, which will include: The bazel cache,
# generated artifacts like .whl files, and "script.log", from the script.
# Note that files created under Docker will be owned by "root".
ls build_output
```
## Contribution & Maintenance
The TensorFlow team does not yet have guidelines in place for contributing to
this directory. We are working on it. Please join a TF SIG Build meeting (see:
bit.ly/tf-sig-build-notes) if you'd like to discuss the future of contributions.
### Brief System Overview
The top-level scripts and utility scripts should be fairly well-documented. Here
is a brief explanation of how they tie together:
1. `envs/*` are lists of variables made with bash syntax. A user must set a
`TFCI` env param pointing to a list of `env` files.
2. `utilities/setup.sh`, initialized by all top-level scripts, reads and sets
values from those `TFCI` paths.
- `set -a` / `set -o allexport` exports the variables from `env` files so
all scripts can use them.
- `utilities/setup_docker.sh` creates a container called `tf` with all
`TFCI_` variables shared to it.
3. Top-level scripts (`wheel.sh`, etc.) reference `env` variables and call
`utilities/` scripts.
- The `tfrun` function makes a command run correctly in Docker if Docker
is enabled.