Test Load Balancer(TLB)



Please use a javascript enabled browser to read this documentation(some of the documentation pages use controls that need javascript support to render)

Documentation for version:

This documentation directory covers core concepts and all the configuration options for TLB .
Should you wish to refer the documentation for a different release, please choose the corresponding version using the selector above.

Concepts in TLB

Components

TLB has two primary components.
  • A Server : that stores and allows querying of test data (test times/test results etc)
  • A Balancer : that partitions and re-orders suite of tests, given a server url
The Server is a repository of test run data. It stores historical test run data (test times and test result as of now) that is posted by balancer(s) and allows balancer(s) to pull the same for future builds, to figure out what tests to run on a partition and what way to order them as shown in Figure 1.

However, the word Server here must not be confused with a Build/Continuous Integration(CI) server that usually schedules builds, assigns task to different machines from the worker-machine grid etc. All of those is the responsibility of a Build/CI Server. Server in the context of TLB is a very simple data white-board that is used by Balancers only to download and upload test data when the test-target of the build executes.

TLB supports two server configurations out of the box. Look at the implementations of the server for more details.

This is where the real action is. The Balancer can be setup in the build script such that it gets a callback before test-runner assumes control and has a listener registered with the given Test Runner. Please check out Example Configurations to understand how this hookup manifests in terms of build-file configurations(its only a few lines).

Just before the tests start running, Balancer gets a call back from the build-tool with the list of all the tests that are to be executed. It prunes this list, reorders it(if need be) and passes on that smaller subset of tests to the Test Runner. When test execution finishes, test-runners typically notify listeners. Using this notification mechanism, the Balancer records the data it cares about and posts this data back to the server.

For instance: the Ant JUnit support uses Ant FileSets(Build) and JUnit task(Test Runner). This task obtains historical test data from the TLB Server and uses that data to split and order tests. The pruned list is then handed over to JUnit. JUnit in turn provides call back on test completion with time the test took(test-time) and outcome of the test(test-result). This data is then posted back to the Server, where it is stored to be downloaded during future runs.

The Balancer has 2 aspects to it. Splitter and Orderer. Check out the Balancer Components to understand them in greater depth. We strongly recommend reading about the Balancer components as they are core to what TLB does.


Apart from the 2 components, there is a 3rd concept in TLB called Job

Job

Job represents work that needs to be parallelized using TLB. For example, executing all the unit tests of a project "Foo" is a job as it is work that can be parallized.

In TLB, a Job has 2 attributes:

  • Job Name

    This uniquely identifies a job in TLB. TLB uses this to store all test data for a given job. In the above example, "Foo-unit-tests" could be set as the name of the job. Then TLB stores all the test run time, failure status etc for the task of running the unit tests parallely under the name "Foo-iunit-tests".

  • Job Version

    Typically, more than one instance of the same job can be running simultaneously and TLB needs to give the right partitions to the right instances. Builds 10 and 11 can be running simultaneously with 2 parititions each for example. In this case TLB should not end up using the data posted by build 11 (if it finished faster) to compute partition information for 10. Nor should TLB use the information posted by partition 1 of build 10 in partition 2 of build 10. Consider the following example:

    Lets say we have 3 partitions - partitions A, B and C. A may have started running tests and may have already reported result and time for a few tests by the time B and C start. Now, lets say B and C want to TimeBalance and hence want data from the server. However, B and C must balance based on the exact same data that A started out with and not the updated data, which has feedback from A. This means, if a new time data is available of a test from A, that should be used in balancing on B. By doing this, we may end up reruning the same test on B as it is faster. This is vital for the mutual exclusion and collective exhaustion principle that TLB follows.

    To solve this problem, TLB has a concept of versioning. When A starts running, it posts the TlbServer a version string against which the server stores a snapshot of data thats relevant for the corresponding TLB_JOB_NAME. When B or C queries data using the same version, they get the same data that A got. This ensures that all partitions see the same data, in-spite of server receiving new data continuously.

    Usually TLB_JOB_VERSION is set such that it changes between suite-runs. For instance, build number can be used as TLB_JOB_VERSION. In this case, A, B and C may all be running at version 10.

    Using a unique version ensures the frozen (hence stale) data is not used for balancing/ordering the new run of the same test suite. When the next build is triggered all three partitions start with the corresponding build number, which may be 11, hence the frozen snapshot of data from version 10 is not used.

    As it can be seen, job versions are important. We advice every Job have an associated job version. Job versioning may be pretty intimidating to understand initially. However, as you start using TLB regularly, you will be able to appreciate the need for Job version.

You can refer to the Configuration Variables to see how TLB_JOB_NAME and TLB_JOB_VERSION are configured.


Interaction between the Server and the Balancer while TLBing

Figure 1: Pictorial expression of aforementioned interaction between Server and Balancer, to show where Server and Balancer fit in the entire act of load balancing.


Typical TLB setup

Figure 2: Typical TLB setup for a JVM based project

The above diagram shows how a typical TLB setup looks like.

  1. TLB Server: Included file 'already_covered' not found in _includes directory
  2. TLB Balancer: Included file 'already_covered' not found in _includes directory
  3. Test Runner: This is the runner that actually runs the tests. TLB does take the responsibility of running tests. That is still handled by the underlying framework i.e. the Test Runner. Example: JUnit, RSpec, Test::Unit, NUnit etc.
  4. Partitions: These are the parallel machines/VMs/processes/threads that run the same build task. With TLB hooked in, each of these partitions execute mutually exclusive sets of tests. This is also responsible for kicking-off the build so that the test framework and TLB are started up. Typically the machines that execute test-task(s)(in-turn partitions) in a CI environment, are agents from build grid machines or build farm of a CI/build server.
  5. Server - Balancer communication: Balancer posts test-related-data about current test run to the server and obtains historical data when it is trying to balance and re-order.
  6. Balancer - Test Runner communication: Before a test runner starts executing tests, TLB gets a callback with the original list(which includes all the tests). This will be same across all the partitions. TLB executes the same partitioning algorithm in each partition, and selects the subset corresponding to the given partition number, and passes the subset on the the test-runner. Thus, the test-runner(in each such partition) ends up executing a smaller set of tests.

    Most test runners provide a mechanism to hooking up listeners that are posted notification about test execution status/life-cycle. Using this, TLB gets information about test run-times and outcome(result). This is what gets posted back to the TLB Server.

TLB works a bit differently in an alien(non-JVM) environment. Please check Typical TLB setup in Alien Environment for details.


Balanced test-execution events on time-line

The animation below gives a broad overview of the major events that happen from invocation to termination of a balanced test-task. The two lines represent server side and balancer side, and life-cycle events on the server-side represent balancer calling out to server.

Figure 3: Balanced test-execution events on time-line


TLB has two major functional components.

Please take a look at Splitter/Balancer Criteria Configuration and Orderer Configuration to understand how Splitters and Orderers can be chosen and configured.

We recommend users to get a good understanding of roles these two components play. As TLB evolves, the list of alternative algorithms is likely to grow for these functional components. Since TLB configuration allows for a lot of mix-and-match style setup, understanding the role of these components at a broader level will enables users to tweak it in ways that work best for them.


TLB can work against two different server implementations. This means Balancer can run against either of them. One of them comes packaged in TLB distribution which is called "TLB Server" and the other is a well known Continuous-Integration/Release-Management server called "Go Server".

TLB server[ http://github.com/test-load-balancer/tlb ]:

Server is bundled in both tlb-server-gXXX.tar.gz and tlb-complete-gXXX.tar.gz archives and can be obtained from the download page. TLB server comes packaged as tlb-server jar (a java archive that carries all dependencies that server requires). You can use the "server.[sh|bat]" scripts to manage the server process(these scripts are bundled in the distribution). Alternatively you can directly invoke: $ java -jar tlb-server-gXXX.jar (substitute the corresponding version/revision of jar used).

TLB server binds to port[1] 7019. Once the server is up, partitions that are to be balanced(balancer instances), can be pointed to it by setting the environment variable [2] to the base url of the TLB server. Balancer works using an abstraction called Server, and TlbServer[3] is an implementation of this contract that comes packaged in TLB.

Or

Go server[ http://www.thoughtworks-studios.com/go-agile-release-management ]:

TLB has inbuilt support for Go, which means TLB can balance against Go just like it balances against the TLB-Server. Running against Go obviously means the tests are run as part of a Go-Task, which will run on a Go-Agent. Additionally, because TLB is environment aware, it can implicit a few things while running against Go server. It deduces equivalent of things like job-name[4], version[5] and total partitions[6] from the way jobs are configured under stage and pipeline. To make TLB work with Go, Server needs to use GoServer[3].

In this case, you do not need to run a separate process(TLB Server) to act as server, because Go-server plays that role. This does not need any change in the go-server or go-configuration apart from the naming convention your Go job-names need to follow. The convention is that they need to be of the form "<my-job-name>-X"(where X is a natural number 1..n, when you want to make n partitions), or "<my-job-name>-<UUID>"(where each such job will be made to execute only one partition).

TLB has an abstraction called Talk to service. This is responsible for enabling TLB to talk to the remote Server component. TLB uses this abstraction to download test-results, test-times etc from this repository and to publish the run-feedback back to the repository (which is used by future builds).

TLB is going to continue introducing such abstractions as it evolves, because this style lends itself to enormous flexibility and configurability, and allows us to provide useful options at every level that user can choose from. In addition to this, it allows users to write their own implementation for these abstractions, hence allowing easy plugability and extensibility.



Figure 4: Typical TLB setup for a non-Java/non-JVM project

TLB supports projects non-Java/non-JVM languages by making the Balancer available as a first class process, which is an HTTP server. A thin language specific library takes care of hooking up with the build tool to allow pruning 'to-be-run' tests list and also attaches a test-listner with the testing framework. These hooks in turn are just wrappers that make HTTP request to the local Balancer server(with plain text payload, the response to which has plain-text payload too). Once the request reaches the balancer, the regular algorithms kick in, and it goes through the same flow as Java support does. Since all infrastructure except some glue code is reused, implementing support for a new languages or frameworks is just a matter of spawning the balancer-process and having the support library talk to it over HTTP.

  1. TLB Server: Already covered in previous sections.
  2. TLB Balancer Server: This is the standalone Java process that exposes its services over HTTP. This server is actually a balancer, which is capable of partitioning given list of test names, and is capable of reporting test-times and test-results back to the TLB Server. This is all that the language specific glue-code library needs to talk to, the actual TLB server is abstracted away.
  3. Test Runner: This is the runner that actually runs the tests. TLB does take the responsibility of running tests. That is still handled by the underlying framework i.e. the Test Runner. Example: JUnit, RSpec, Test::Unit, NUnit etc.
  4. Partitions: These are the parallel machines/VMs/processes/threads that run the same build task. With TLB hooked in, each of these partitions execute mutually exclusive sets of tests. This is also responsible for kicking-off the build so that the test framework and TLB are started up. Typically the machines that execute test-task(s)(in-turn partitions) in a CI environment, are agents from build grid machines or build farm of a CI/build server.
  5. Server - Balancer communication: Balancer posts test-related-data about current test run to the server and obtains historical data when it is trying to balance and re-order.
  6. Balancing Server - Support Library communication: This is the interaction between the Balancer server and the Support Library. Support library makes HTTP requests to the Balancer Server. Support Library posts the list of the tests and gets back a pruned and reordered list from the Balancer server. It also posts the test data(result and time) to the Balancer server, and gets an acknowledgment.
  7. Support Library: This is the platform/framework/language specific library that hooks up with the build-tool and test-runner of the subject environment, and is responsible for launching, talking to the Balancer Server over HTTP, and tearing down the Balancer Server. This is what in some instances in this documentation is referred to as glue-code. This is the only component that needs to be written language that needs to be supported.
  8. Test Runner - Support Library communication: Before a test runner starts executing tests, TLB gets a callback with the original list of all the tests. This will be(note: must be) same across all the partitions. TLB(Balancing Server) executes the same algorithm on each partition(identified by partition number), and returns the correct pruned and reordered subset of tests to the Test Runner. Hence, the Test Runner ends up executing only a smaller set of tests, taking way lesser time compared to executing the whole suite serially.

    Most test runners provide a mechanism for hooking up listeners that are notified of test-state as tests execute. The glue-code library uses test-listeners to capture test run times and result and posts it to the Balancing Server, which in-turn, pushes it to the actual TLB Server(or equivalent).


While this documentation explains the theory in considerable depth, we believe nothing can beat pulling your sleeves up and getting hands dirty with a working example. We highly recommend checking out 'sample_projects' which we believe will prove very useful for both understanding and incorporating test-load-balancing in your project. We encourage users to play with the values in examples and observe the effect it has on balancing/reordering firsthand.