Getting call performance data in a live Tornado application

In deployed webservices, it's sometimes useful to be able to get data about

  • what calls are being made under the hood
  • with what arguments
  • in what order
  • and how long they take.

This is especially useful if

  • the web service in question has a glue-together kind of role and pulls in data from various other micro services
  • performance degrades only for specific arguments and / or against live data

The former is very often true for Tornado, and the latter is more of a simple annoyance -- you can obviously spin up a read-only copy pointing to the live data sources, but that usually involves some effort.

In this case, it's useful being able to get basic call data for a particular request by just passing in a flag to the request.

There's two parts to implementing this for Tornado: StackContexts to do the actual tracking of function calls, and some Python decorators plus functional programming for bookkeeping.

We need to have the bookkeeping part to avoid leaking resources (i.e., never freeing the tracked data) and because the tracked data needs to be accessed after the stack doesn't exist any more.

Have you tried turning it off and on again?

Tornado's stack contexts can get activated and deactivated multiple times. Both __enter__ and __exit__ will always be called; if the function blocks, Tornado will deactivate the context and go on to execute other things that are waiting for CPU time. This is also why the StackContext wrapper takes a factory callable, not an instance.

This means that we can get

  • wall clock time spent as the difference of last deactivation and first activation
  • CPU time spent as the sum of the individual deactivation and activation differences

This also means, however, that we need some global state to store data across the context's activations.

To build a nice hierarchical structure for the calls, we also need to track the stack state and differentiate between multiple invocations of the same function.

Assuming for a moment that we have all that, we can implement our stack context as

@contextlib.contextmanager
def profiling_gathering_context(request_id, invocation_id, func_name, *func_args, **func_kwargs):  
    # Save the previous state to restore upon exiting
    prev_request_id = _STATE.request_id
    # Apply the new state and add this call to the stack
    _STATE.request_id = request_id
    _STATE.stack.append((request_id, invocation_id, func_name))

    # Retrieve the data object for this call. It could be nested, e.g.,
    # 3 levels deep in the stack, so just go through the stack frames
    # and retrieve the data object
    curr_profile = _STATE.profiling[request_id]
    for k in _STATE.stack:
        curr_profile = curr_profile[k]

    # Track the start time and invocation args
    start = time()
    curr_profile.start = min(curr_profile.get('start', start + 1), start)
    curr_profile.args = func_args
    curr_profile.kwargs = func_kwargs
    try:
        yield
    finally:
        # Track the end time
        end = time()
        curr_profile.end = max(curr_profile.get('end', end - 1), end)
        # and restore previous state
        _STATE.request_id = prev_request_id
        _STATE.stack.pop()

The assumption is that request_id is the same across the whole depth of the stack (i.e., for the whole request) and invocation_id is unique for the particular call of a function.

Since Tornado can call this multiple times, we need to restore any global state changes we make upon exiting, so we save the previous request_id.

The stack tracking can be done globally, because Tornado will call all the previous wrappers before reactivating the current call, e.g., for a function call like a -> b -> c, upon deactivating, it will deactivate c -> b -> a, and when c unblocks, it will activate a -> b -> c again.

Putting up the decorations

The stack context implementation above assumes we have a request_id and invocation_id. It also stores some data against those keys and never deletes that data, which means that as it stands, we're leaking resources. We also need to have an elegant way of automatically wrapping the calls in our stack context.

The first part of the solution is a function decorator

def profiled_func(_impl):  
    @functools.wraps(_impl)
    def _wrapped(*args, **kwargs):
        request_id = kwargs.pop('request_id', _STATE.request_id)

        if request_id:
            invocation_id = str(uuid4())
            with StackContext(functools.partial(
                    profiling_gathering_context, request_id, 
                    invocation_id, _impl.__name__, *args, **kwargs)):
                return _impl(*args, **kwargs)
        else:
            return _impl(*args, **kwargs)

    return _wrapped

This simply generates an invocation_id and wraps our function in a partially applied stack context if a request_id is either present in the kwargs or set globally. It gets set globally in our stack context, so it being set globally means that we're simply deeper down the stack, not at the initial, root call.

To generate the initial request_id and clean up the global state, we use another Python context

class _ProfilingInfo(object):  
    def __init__(self, result, profiling_info):
        self.result = result
        self._profiling_info = profiling_info

    @property
    def profiling_info(self):
        return _format_profiling_info(self._profiling_info)


@contextlib.contextmanager
def profiling_context(func):  
    request_id = str(uuid4())

    try:
        result = func(request_id=request_id)
        profiling_info_ref = _STATE.profiling[request_id]
        yield _ProfilingInfo(result, profiling_info_ref)
    finally:
        del _STATE.profiling[request_id]

The profiling_context takes a callable and yields a structure containing the Future object and the profiling data. The context basically just transfers the profiling data reference from the global state into the yielded object, so it gets garbage collected when it goes out of scope together with the RequestHandler instance and we're guaranteed to never leak memory.

We can then use this in a Tornado RequestHandler like this

class Handler(web.RequestHandler):  
    @gen.coroutine
    def get(self, path):
        # Add your own triggering mechanism here...
        if path == 'profile':
            with profiling_context(functools.partial(self.get_impl, 'baz')) as ctx:
                result = yield ctx.result
            result['_debug'] = ctx.profiling_info
        else:
            result = yield self.get_impl('bar')

        self.finish(result)

    @profiled_func
    @gen.coroutine
    def get_impl(self, arg):
        yield [nested_sleep_for_level_1(random()), sleep_for(random())]
        return {'foo': arg}

This is just a simplified example, there are also ways to have this handled higher up and be transparent for all implemented methods, including

  • overriding _execute and wiring this in there
  • wrapping this in initialize() via reflection like for m in self.SUPPORTED_METHODS: # getattr(self, m.lower()) ...

The output from all this is hierarchical JSON that's easy to plot using whatever method you need -- the start and end times, durations and arguments are there, so this lends itself well to the kind of graphs you'd see in your browser's dev tools' Network tab.

Full proof-of-concept code and example output available on GitHub.

comments powered by Disqus