Improving diagnostic capabilities in a third party Go-based tool

November 10, 2020

gcp golang bigquery

Premise

I recently had opportunity to discuss ways to improve the overall supportability and diagnostic capabilities of a third party tool that integrates with Google Cloud services (namely BigQuery). It yielded some interesting discussions and a tiny bit of code, so I thought I’d capture some of that here.

TL;DR: I wrote a simple example to show how a consumer of cloud.google.com/go/bigquery can add debug logging of request/response interactions with BigQuery, and spend time explaining why this can be valuable. See details towards the end if you’re just interested in the code.

Background

Consider this (vastly simplified) architecture pattern:

Example Third Party Tool

In this scenario, we have a user who engages with GCP services indirectly via a third party tool. That tool has its own application code and supplementary libraries, either built or sourced by the tool maker. However, to communicate with the GCP service (in this case BigQuery), they’re using one of the libraries provided to GCP users, the cloud.google.com/go/bigquery package. That package itself has a direct dependency on another package, google.golang.org/api/bigquery/v2 which is a lower level library that manages aspects like JSON serialization/deserialization when interacting with BigQuery.

In this architecture, all communication with BigQuery is mediated through the provided libraries.

Is there a doctor in the house?

Let’s examine this architecture through the lens of identifying and resolving failures on behalf of the user. In a perfect world, our hypothetical user suffers from no engineering defects, and their experiences are one of bliss. That hypothetical user is boring, so let’s talk about a realistic user. Our realistic user has real problems, and they’re borne of many possible factors: unusual use cases, transient problems, changing requirements.

Using such a lens, there’s three groups who are poised to provide the most insight and ability to address issues that are impacting our user.

There’s the tool maintainers, who built and provide the actual tool to the user. These toolsmiths control how the tool works, and thus have awareness of both how user actions in the tool translate to intended interactions with the remote services. They may not always have full understanding however, particularly if the interaction surface is rich and/or there’s a lot of context brought to the interaction by the user. A classic example: If a user supplies SQL text directly to the tool, you can’t always infer their intent from the supplied statements.
There’s the library maintainers, who ensure that the programming abstraction supplied to the tool maintainers works as expected, is sufficiently expressive, and provides quality-of-life improvements for the tool builders. Looking the other direction, the library helps to ensure that requests to the backend service are well formed and that the responses issued by the service can be properly understood and consumed.
Finally, there’s the service maintainers, who run the service itself, and are concerned with correctness, performance, and all the other concerns that go along with building and running a service. The service maintainers may or may not have any special awareness of any specific interactions, users, tools, or workloads.

Describe your symptoms

The parable of the blind men and the elephant I find is often useful when describing aspects of group interactions, and it’s relevant here. While we have three different groups capable of helping in our diagnostic scenario, when they need to interact with one another there’s confusion because they’re using different abstractions to describe the problem. The tool maintainers may describe this in terms of a specific function call they find problematic; the backend engineers may describe things in terms of a specific microservice that is returning a malformed response, for example.

So, how do we improve their ability to communicate discoveries with one another? How do we find a common abstraction to communicate? In this, our answer is less about abstraction, and more about sharing concrete details that all parties can interpret. So, let’s talk concretely about what all these parties are working with, namely the BigQuery API. The primary BigQuery API is a REST-based service, which leverages JSON for sending requests and responses. If you’re interested in the details, the REST documentation documents how low level interactions with the service work. There are other APIs within BigQuery that use other protocols and technologies (e.g. gRPC), but we’re constraining our focus to the one API for now.

The middle ground we want is something that’s easily digestible by each group, yet still valuable in and of itself to each group. In this scenario, that’s having a record of the manifested interaction betwen the client and the backend in the form of captured HTTP request and response bodies. Each group gets value from this:

The maintainers of the tool can see in detail how the interactions manifest with the service, and can compare that with the service documentation.
Similarly, the maintainers of the libraries can understand the interaction patterns between both the client and backend groups, and identify problems with the abstractions.
And finally, the maintainers of the service can understand exactly what’s being sent to them, and how the service is responding.

Building Our Diagnostics

In order to enable our desired solution, we need to wire up the instrumentation. The place we do that is in the tool itself. Fortunately, it’s not all that painful. I wrote a very simple example logging package as a proof of concept: https://github.com/shollyman/apilogging. It’s greatly simplified by relying on other supporting packages.

I previously mentioned the use of the cloud.google.com/go/bigquery package, and it’s reliance on the lower-level google.golang.org/api/bigquery/v2 package. One of the properties of the library is that it uses a Client abstraction, and many/most of the options you can pass to the bigquery library propagate into the usage of the dependency. One of the things we can do is pass in a custom HTTP implementation into the constructor as a variadic argument: option.WithHTTPClient.

However, there’s a lot of heavy lifting done in the HTTP transport when communicating with the API, like adding in headers like authentication, and we don’t want to foist rewriting all that onto anyone. Fortunately, much like you see in Go middleware HTTP server patterns, we can simply chain in an interceptor through use of the RoundTripper interface in the net/http package.

Within google.golang.org/api, the google.golang.org/api/transport/http package provides support for much of the things we just mentioned like injecting authentication credentials into requests. In part, it works by (re)-using the same RoundTripper interface to add additional bits to a request before sending it to the actual API service. So, our task is to chain things properly so that we can log what gets sent out after modification, and log what comes back before any transformation may happen. Fortunately, this transport package exposes a NewTransport() method that allows the interceptor to be built, by accepting a base RoundTripper.

Ermagerd, just show some code already

So, what’s it look like to use the example logger? First, let’s look at the typical instantiation of a BigQuery client:


	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)

Now, to use the example logger, most of our effort goes into configuring the “where” of writing our diagnostics to a location. Here’s an example that writes to it’s own logfile:

	ctx := context.Background()

	// Setup a file logger.
	fname := "integration_log_bigquery.txt"
	f, err := os.Create(fname)
	if err != nil {
		t.Fatalf("os.Create: %v", err)
	}
	defer f.Close()
	l := log.New(f, "", log.LstdFlags)

    // Now, use the logger to construct an instrumented HTTP transport.
	hc, err := NewLoggingHTTPClient(ctx, &LoggerConfig{
		Logger:              l,
		CaptureFullRequest:  true,
		CaptureFullResponse: true,
	})
	if err != nil {
		t.Fatalf("NewLoggingClient: %v", err)
	}
    // Finally, use the instrumented transport to construct a BigQuery client.
	client, err := bigquery.NewClient(ctx, projectID, option.WithHTTPClient(hc))

Example Output

Here’s an example request sent to enumerate the set of datasets in the BigQuery public dataset project. Within the tool, they invoked client.DatasetsInProject(ctx, projectID) and traversed the corresponding iterator within the BigQuery library:

GET /bigquery/v2/projects/bigquery-public-data/datasets?all=false&alt=json&pageToken=utility_eu&prettyPrint=false HTTP/1.1
Host: bigquery.googleapis.com
Authorization: Bearer <REDACTED>
User-Agent: google-api-go-client/0.5
X-Cloud-Trace-Context: 8539ac589047afe5ddd8137966d27c34/1140646270334262235;o=0
X-Goog-Api-Client: gl-go/1.15.2 gccl/20201002

From this, we see that we have authentication information (the redacted Bearer header), as well as other information such as information about the library and additional trace instrumentation (another topic).

And here we see the corresponding response:

HTTP/2.0 200 OK
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Cache-Control: private
Content-Type: application/json; charset=UTF-8
Date: Thu, 29 Oct 2020 16:32:23 GMT
Etag: 6Yqip2DpH6xw7EmscsBfWg==
Server: ESF
Vary: Origin
Vary: X-Origin
Vary: Referer
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0

{"kind":"bigquery#datasetList","etag":"6Yqip2DpH6xw7EmscsBfWg==","datasets":[{"kind":"bigquery#dataset","id":"bigquery-public-data:utility_us","datasetReference":{"datasetId":"utility_us","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:wikipedia","datasetReference":{"datasetId":"wikipedia","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:wise_all_sky_data_release","datasetReference":{"datasetId":"wise_all_sky_data_release","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:world_bank_global_population","datasetReference":{"datasetId":"world_bank_global_population","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:world_bank_health_population","datasetReference":{"datasetId":"world_bank_health_population","projectId":"bigquery-public-data"},"labels":{"freebqcovid":"freebqcovid"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:world_bank_intl_debt","datasetReference":{"datasetId":"world_bank_intl_debt","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:world_bank_intl_education","datasetReference":{"datasetId":"world_bank_intl_education","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:world_bank_wdi","datasetReference":{"datasetId":"world_bank_wdi","projectId":"bigquery-public-data"},"location":"US"},{"kind":"bigquery#dataset","id":"bigquery-public-data:worldpop","datasetReference":{"datasetId":"worldpop","projectId":"bigquery-public-data"},"labels":{"freebqcovid":"freebqcovid"},"location":"US"}]}

And just like that, we have instrumented calls with debugging details. A production implementation will likely hook into a more robust logging implementation.

Finally, astute observers may also note that in the example request/response, there’s a mismatch. The request indicates that the request used HTTP/1.1, but the response reports HTTP/2.0. This is due to the fact that the instrumented RoundTripper still passes the request to a default HTTP transport, and Go will upgrade this request to use HTTP2 when possible.