Digital risks move quickly, and to respond to brokers and policyholders at the various touchpoints of their cyber insurance journey, Coalition utilizes multiple tools and services. Each day, brokers log into the Broker Dashboard while customer service staff Admin Tools to issue numerous quotes and policies to an ever-increasing client list. However, this leads to a scaling issue during renewals. If Coalition signs 1,000 new policyholders in Year 1 and 2,000 new policyholders in Year 2, then in Year 2 we will hopefully issue a combined total of 3,000 policies (2,000 new business and 1,000 renewals). Compounded over just a few years, this leads to challenges and strains on several systems and support staff. Since the renewal cycle for a policy begins 120 days before the expiration date, many policies are up for renewal at a given time. We need the ability to process renewals at scale and reliably while maintaining visibility into the events attached to a given policy.
Before the switch to Temporal, which went live in August of 2021, renewals at Coalition were handled via a series of cron jobs. Each job was run daily, and as part of the execution would perform a database check to see which policies needed to have a specific action taken on them. For example, the first job’s responsibility was to determine the renewal status of a given policy. The status would either be automatic, meaning Coalition did not need any additional information and could automatically issue a renewal policy or manual, meaning Coalition needed an updated application from the broker and policyholder in order to move forward. Another job was responsible for identifying all policies that were within 90 days of expiration and marked as manual renewal and to prefill an application and send it via email to the broker. There were many other jobs that ran in order to take all of the different actions up until the expiration date of the policy.
Due to the nature of the architecture, there was no easy way of actually tracking what events took place for a given renewal cycle. Based on the database it was possible to get a sense of what had happened, but it was a lot of work to determine and also very database-centric, which did not allow for easy visualization for our internal or external users.
The cron jobs had some built-in reliability because they would retry everything within X number of days in a certain database state. Just retrying the next day was not ideal, however, since in some cases we have legal obligations to meet certain service-level agreements (SLAs). Additionally, this method of retrying had issues if part of the job succeeded leaving the database state for the renewal in a sort of limbo and we had no retry logic for dependent service calls.
Although we were able to write unit tests for the individual cron jobs, it was very challenging to set up any integration tests for the entire renewal process. Unfortunately, this meant we were unaware of certain gaps in the process.
The cron jobs were starting to take a lot of memory as we were scaling up to the number of renewals that we are currently processing since all of them were being run in a single execution. Due to the above topics, we were getting to a point where even if the code could still scale more, the human support would not be able to scale with it.
Temporal offers a lot of visibility into both running and completed workflows via their Web UI, in which you can search by different fields such as workflow ID, run ID, or workflow type. We currently use the Web UI as a developer debugging tool but are looking at expanding its use (see below for more details). In a microservice ecosystem, service-to-service tracing is extremely valuable, and we have been able to set up our renewals services with OpenTracing. We integrated our workflows with Datadog for logging as well as Sentry for error tracking.
The two basic units of work in Temporal are workflows and activities. According to the documentation, an activity is a “single, well-defined action (either short or long-running), such as calling another service, transcoding a media file, or sending an email”. Temporal has built-in retry functionality for when a workflow executes an activity, and we leverage this to get exponential backoff retries for all of our service to service calls. We have improved our reliability considerably with this built-in functionality.
We have architected our renewals workflow using a single long-running workflow with many child workflows that subsequently make calls to different activities. With this, we have built out a robust test suite that can unit test activities and workflows and build integration tests that cover the entire end-to-end renewal process.
Temporal is improving our scalability in terms of the total number of renewals that we can process since they are handling the state persistence and the total computing time for a given renewal is so low as a percentage of the 120 days. From a human side, this has been a complete change for our engineering and customer support teams as the renewals engine is much smoother, consistent, and visible.
From an engineering perspective, the development of Temporal workflows is very modular based on the activities. This leads to well organized code that can be well tested and also quickly developed.
Temporal Web UI offers a lot of visibility into workflows, but this is also exposed via the Temporal Go SDK. One of our upcoming projects is to visually represent a renewal workflow for an existing policy and show all of the actions taken during the renewal process. This visual representation of renewals will benefit our internal users to check the health of a given renewal and see what remedial actions, if any, need to be taken to complete the process. Additionally, this will provide insights and visibility not currently available to customers.
The laws around insurance can vary from state to state, and currently, we have gone with the principle of having workflows that can work for all states. However we want to create workflows that are dynamically created so that they can be customized for a given state. This gives us more flexibility in how we process renewals and can provide a better experience to our customers.
The transition to Temporal has had a few bumps in the road (workflow versioning and other non-deterministic behavior, to name a couple), but overall the development of our new Temporal renewals engine has been smooth. There are a lot of quality examples that we have cherry-picked from our code, and there are also tutorial videos for a number of different scenarios. Their support forum has also been valuable to us. Overall the resources available are solid and have helped us develop code quickly and efficiently.
We are just now beginning to scratch the surface in terms of the capabilities of Temporal and are enthusiastic about exploring new features that help us provide an even better renewal process for Coalition’s customers.
To learn more about engineering at Coalition, visit our careers page for more information and open opportunities.