ci/official/README.md from tensorflow/tensorflow

ci/official/README.md
Summary

Maintainability

Test Coverage

Issues
# Official CI Directory

Maintainer: TensorFlow and TensorFlow DevInfra

Issue Reporting: File an issue against this repo and tag
[@devinfra](https://github.com/orgs/tensorflow/teams/devinfra)

********************************************************************************

## TensorFlow's Official CI and Build/Test Scripts

TensorFlow's official CI jobs run the scripts in this folder. Our internal CI
system, Kokoro, schedules our CI jobs by combining a build script with a file
from the `envs` directory that is filled with configuration options:

-   Nightly jobs (Run nightly on the `nightly` branch)
    -   Uses `wheel.sh`, `libtensorflow.sh`, `code_check_full.sh`
-   Continuous jobs (Run on every GitHub commit)
    -   Uses `pycpp.sh`
-   Presubmit jobs (Run on every GitHub PR)
    -   Uses `pycpp.sh`, `code_check_changed_files.sh`

These "env" files match up with an environment matrix that roughly covers:

-   Different Python versions
-   Linux, MacOS, and Windows machines (these pool definitions are internal)
-   x86 and arm64
-   CPU-only, or with NVIDIA CUDA support (Linux only), or with TPUs

## How to Test Your Changes to TensorFlow

You may check how your changes will affect TensorFlow by:

1. Creating a PR and observing the presubmit test results
2. Running the CI scripts locally, as explained below
3. **Google employees only**: Google employees can use an internal-only tool
called "MLCI" that makes testing more convenient: it can execute any full CI job
against a pending change. Search for "MLCI" internally to find it.

You may invoke a CI script of your choice by following these instructions:

```bash
cd tensorflow-git-dir

# Here is a single-line example of running a script on Linux to build the
# GPU version of TensorFlow for Python 3.12, using the public TF bazel cache and
# a local build cache:
TFCI=py312,linux_x86_cuda,public_cache,disk_cache ci/official/wheel.sh

# First, set your TFCI variable to choose the environment settings.
#   TFCI is a comma-separated list of filenames from the envs directory, which
#   are all settings for the scripts. TF's CI jobs are all made of a combination
#   of these env files.
#
#   If you've clicked on a test result from our CI (via a dashboard or GitHub link),
#   click to "Invocation Details" and find BUILD_CONFIG, which will contain a TFCI
#   value in the "env_vars" list that you can choose to copy that environment.
#      Ex. 1: TFCI=py311,linux_x86_cuda,nightly_upload  (nightly job)
#      Ex. 2: TFCI=py39,linux_x86,rbe                   (continuous job)
#   Non-Googlers should replace "nightly_upload" or "rbe" with
#   "public_cache,disk_cache".
#   Googlers should replace "nightly_upload" with "public_cache,disk_cache" or
#   "rbe", if you have set up your system to use RBE (see further below).
#
# Here is how to choose your TFCI value:
# 1. A Python version must come first, because other scripts reference it.
#      Ex. py39  -- Python 3.9
#      Ex. py310 -- Python 3.10
#      Ex. py311 -- Python 3.11
#      Ex. py312 -- Python 3.12
# 2. Choose the platform, which corresponds to the version of TensorFlow to
#    build. This should also match the system you're using--you cannot build
#    the TF MacOS package from Linux.
#      Ex. linux_x86        -- x86_64 Linux platform
#      Ex. linux_x86_cuda   -- x86_64 Linux platform, with Nvidia CUDA support
#      Ex. macos_arm64      -- arm64 MacOS platform
# 3. Add modifiers. Some modifiers for local execution are:
#      Ex. disk_cache -- Use a local cache
#      Ex. public_cache -- Use TF's public cache (read-only)
#      Ex. public_cache_push -- Use TF's public cache (read and write, Googlers only)
#      Ex. rbe        -- Use RBE for faster builds (Googlers only; see below)
#      Ex. no_docker  -- Disable docker on enabled platforms
#    See full examples below for more details on these. Some other modifiers are:
#      Ex. versions_upload -- for TF official release versions
#      Ex. nightly_upload -- for TF nightly official builds; changes version numbers
#      Ex. no_upload      -- Disable all uploads, usually for temporary CI issues

# Recommended: use a local+remote cache.
#
#   Bazel will cache your builds in tensorflow/build_output/cache,
#   and will also try using public build cache results to speed up
#   your builds. This usually saves a lot of time, especially when
#   re-running tests. However, note that:
#
#    - New environments like new CUDA versions, changes to manylinux,
#      compilers, etc. can cause undefined behavior such as build failures
#      or tests passing incorrectly.
#    - Automatic LLVM updates are known to extend build time even with
#      the cache; this is unavoidable.
export TFCI=py311,linux_x86,public_cache,disk_cache

# Recommended: Configure Docker. (Linux only)
#
#   TF uses hub.docker.com/r/tensorflow/build containers for CI,
#   and scripts on Linux create a persistent container called "tf"
#   which mounts your TensorFlow directory into the container.
#
#   Important: because the container is persistent, you cannot change TFCI
#   variables in between script executions. To forcibly remove the
#   container and start fresh, run "docker rm -f tf". Removing the container
#   destroys some temporary bazel data and causes longer builds.
#
#   You will need the NVIDIA Container Toolkit for GPU testing:
#   https://github.com/NVIDIA/nvidia-container-toolkit
#
#   Note: if you interrupt a bazel command on docker (ctrl-c), you
#   will need to run `docker exec tf pkill bazel` to quit bazel.
#
#   Note: new files created from the container are owned by "root".
#   You can run e.g. `docker exec tf chown -R $(id -u):$(id -g) build_output`
#   to transfer ownership to your user.
#
# Docker is enabled by default on Linux. You may disable it if you prefer:
# export TFCI=py311,linux_x86,no_docker

# Advanced: Use Remote Build Execution (RBE) (internal developers only)
#
#   RBE dramatically speeds up builds and testing. It also gives you a
#   public URL to share your build results with collaborators. However,
#   it is only available to a limited set of internal TensorFlow developers.
#
#   RBE is incompatible with local caching, so you must remove
#   disk_cache, public_cache, and public_cache_push from your $TFCI file.
#
# To use RBE, you must first run `gcloud auth application-default login`, then:
export TFCI=py311,linux_x86,rbe

# Finally: Run your script of choice.
#   If you've clicked on a test result from our CI (via a dashboard or GitHub link),
#   click to "Invocation Details" and find BUILD_CONFIG, which will contain a
#   "build_file" item that indicates the script used.
ci/official/wheel.sh

# Advanced: Select specific build/test targets with "any.sh".
# TF_ANY_TARGETS=":your/target" TF_ANY_MODE="test" ci/official/any.sh

# Afterwards: Examine the results, which will include: The bazel cache,
# generated artifacts like .whl files, and "script.log", from the script.
# Note that files created under Docker will be owned by "root".
ls build_output
```

## Contribution & Maintenance

The TensorFlow team does not yet have guidelines in place for contributing to
this directory. We are working on it. Please join a TF SIG Build meeting (see:
bit.ly/tf-sig-build-notes) if you'd like to discuss the future of contributions.

### Brief System Overview

The top-level scripts and utility scripts should be fairly well-documented. Here
is a brief explanation of how they tie together:

1.  `envs/*` are lists of variables made with bash syntax. A user must set a
    `TFCI` env param pointing to a list of `env` files.
2.  `utilities/setup.sh`, initialized by all top-level scripts, reads and sets
    values from those `TFCI` paths.
    -   `set -a` / `set -o allexport` exports the variables from `env` files so
        all scripts can use them.
    -   `utilities/setup_docker.sh` creates a container called `tf` with all
        `TFCI_` variables shared to it.
3.  Top-level scripts (`wheel.sh`, etc.) reference `env` variables and call
    `utilities/` scripts.
    -   The `tfrun` function makes a command run correctly in Docker if Docker
        is enabled.