Leveraging the Actor Model in Financial Transaction Systems | Nikita Melnikov | Conf42 PE 2024

Financial systems face complex challenges, but using the Acro model can streamline processes and solve concurrency issues effectively.

My name is Nikita Melikov, and I am the VP of Engineering at Atlantic Money. Today, we will talk about effortless concurrency, financial systems, their challenges, and how to use the Akka model to solve everything we need in these kinds of systems.

About me: As I said before, I am the VP of Engineering at Atlantic Money. I also have a background in banking and investments, with more than 10 years of experience in fintech. I have worked on high-output systems that handled more than 300,000 requests per second. Additionally, I love working with Scala, Go, PostgreSQL, and Kafka.

Let's begin by taking a look at the agenda. We will cover a few topics, moving from the problem to the solution that I propose. The problem is about financial transactions and the issues we might encounter. After that, we will shift to the traditional approach and its limitations. We will, of course, speak about asynchronous processing, Kafka, how to implement asynchronous processing using Kafka, why we need it, and how it could help us. The main topic is the Akka model.

What is a financial transaction? We will use our company as an example and review the standard process for our cases. A customer can create a transfer in our app. For example, if you want to send $100 to someone, you create the transfer in our app, send a $100 payment to our app, and we can send Euros to your recipient. Let's simplify it a bit: we are waiting for dollars, we run some checks, and we send Euros.

Sounds simple, right? But in fact, what happens in the real world is more complex.

First of all, as we discussed before, the customer creates a USD to Euro transfer. The system waits for USD, and the system requires the payment details. We need to know the payment amount, sender, recipient, bank information, and any other necessary information. After that, the system runs several checks: sanction lists, anti-fraud systems, payment limits, fee calculations, and many more. Of course, we have to exchange currencies to send another currency to your recipient. Finally, we can send EUR to the recipient.

Where is the problem? There can be many problems in terms of compliance, operations, finance, and so on, but we will review only one technical problem today: the problem of lost updates.

Let's take a look at the diagram. When the transfer system receives your payment, we should find the proper transfer in the database. After finding it, we should check the sender and recipient, for example, in our compliance service. If everything is fine, we can go further, processing the request, doing some calculations, applying fees, exchanging currencies, and finally updating the transfer to set the status that everything is fine and the money has been paid out to your recipient.

Where is the problem? The problem arises when the compliance system decides to cancel the transfer at the same time while we are processing the original transfer. The process looks mostly the same, but after successful checks, the compliance service can decide to cancel the transfer. It hits the endpoint transfer cancel, and we should select the transfer for restriction and update the transfer as well.

Let's take a look at two transactions. On the left, we have the compliance transaction, and on the right, we have our payment to our system. We start the transaction by selecting the transfer and getting the result. The result is the same: ID one, status created, everything should be fine in both transactions. After that, we set different statuses: for compliance, we set the status to canceled; for our payment, we set the status to payment received.

=> 00:06:28

Concurrent transactions can lead to undefined statuses; use "select for update" to ensure data consistency.

Let's take a look at two transactions. On the left, we have a compliance transaction and on the right, we have our payment to our system. We start the transaction by selecting transfer and getting the result. The result is the same: ID one, status created. Everything should be fine in both transactions. After that, we set different statuses. For compliance, we set the status to cancelled. For our payment, we set the status to payment received. We then commit the changes.

The problem arises if we select transfer right after these changes. We can get an undefined status because we have two concurrent transactions and we don't know what exactly we want to have at the end. We cannot guarantee the order of the statuses, so the status can be undefined, cancelled, payment received, or any other.

To solve it, we will talk about traditional approaches. Although I don't have the time to provide all possible options, and we are limited by the technology we use, let's imagine we have a Postgres database. We can solve it this way: we have a database transaction. It looks the same as before. We start the transaction, select transfer for update, and then commit the changes. Everything is okay.

Let's implement it on the diagram. This is the same diagram as before, except for one small thing. Instead of just selecting the transfer, we select it for update. We start the payment receiving process by selecting for update, checking the sender and recipient. Everything is fine at that moment. The compliance system decides to cancel and restrict the transfer. It does select for update as well but should wait until we have one open transaction for update for one single row or many rows. Postgres knows that everything should be blocked until the first process is completed.

The first process, when we are receiving the payment, is pending now. We selected transfer, did some calculations, worked on the processing, and finally did the update on step six. Once we committed changes, we can continue with the second process. We already sent the select for update request from the compliance service to our database and were waiting for the result. Now the result is finally here, and we can continue. Everything is solved. There are only two database transactions, no new abstractions or tools.

However, there are limitations. Let's make a quick calculation. Imagine we have an average processing time of 5 seconds and 100 operations per second. That means we should have 500 active transactions. This doesn't seem complicated; Postgres could handle this. But there are two drawbacks. First, resources: we have many active transactions that do nothing because we are waiting, going to external services, writing some AIT data, and getting data from another service. While we do this, the transaction is active, and we cannot work with this locked row, for example, this transfer.

The second point is about connection pools. Most of you have a connection pooling system in your apps, limited to, for example, 16 connections to your Postgres instance. Everything should be fine because we have different transfers. The problem arises if you have multiple concurrent transactions in the database. Imagine we have 16 concurrent processes at the same time.

=> 00:11:38

Connection pools can bottleneck your app if too many concurrent transactions are fighting for limited database access.

We are receiving AIT data from another service, which introduces several complexities. During this transaction, the process remains active, preventing us from working with a locked row, such as in a transfer scenario. Another critical point concerns connection pools. Most applications utilize a connection pooling system, often limited to 16 connections to a PostgreSQL instance. This setup generally works well with different transfers. However, issues arise with multiple concurrent transactions in the database. Imagine having 16 concurrent processes; the first connection is acquired by an actual transaction, and while processing takes about 5 seconds, other operations must wait. This contention on the lock and database can lead to an empty connection pool, where every connection is busy, preventing the servicing of customer requests or other transfers.

To address this, we can consider implementing locks. Locks can be implemented in various systems in different ways. Today, we will discuss local locks and distributed locks. Starting with local locks in Go, this can be done via a mutex object. When running a transfer, we attempt to acquire the mutex and unlock it afterward. If the mutex is already locked, the goroutine will wait until the lock is resolved.

This approach is straightforward but problematic when dealing with multiple nodes or endpoints. If one node holds a local lock for a transfer and another node receives a request for the same transfer, the locks become ineffective.

To overcome this, we can use distributed locks, which introduce additional infrastructure and complexity. In a distributed lock system, instead of a single mutex in a single-node application, we use a distributed lock manager. Node one can request a lock from the lock manager. If the lock is not already acquired by another process, it is granted. If another node requests the same lock, it will wait. This ensures that if node one has access, it can safely access resources, and once done, it releases the lock, allowing the lock manager to grant it to the next node.

Regarding storage options for distributed locks, there are many choices, including Hazelcast, Zookeeper, etcd, Consul, and Redis. Each has its advantages and can be selected based on familiarity and requirements.

=> 00:16:41

Managing distributed systems is all about mastering timeouts and avoiding deadlocks.

You need whatever you know, whatever you like, so you can use everything. What about limitations here? Limitations are quite complex actually because under load, it can be quite challenging to understand how the system will work due to the problem of ordering. Let's imagine you have multiple nodes, for example, a cluster with six nodes. They can be placed in different zones, different data centers, and different networks. Due to many factors, they just have different timing to access your lock manager, for example, Hazelcast or Zookeeper. This is the first problem. We cannot understand why we've got a lock to the node that, for example, called it later than the first one. It is quite difficult to understand how it works just because of the problem of ordering timeouts.

There are actually multiple timeouts. First of all, there is the lock acquisition timeout. As engineers, we should understand that when trying to get a lock in our lock manager, we should be able to set some timeout because we cannot wait an infinite time for this lock; it will just hang the process. We should understand how to set this timeout and how to guarantee this timeout. The second point is how to manage this timeout when we have already acquired the lock. Let's imagine we spent 3 seconds waiting for a lock, and now we should limit the time when we process and run the actual backend logic. We already have two points here: we should respect timeouts and understand how to work with them.

What should we do if we didn't get the lock? Probably everything should be fine; we just didn't do anything. What if we got the lock but lost it or the process crashed while holding the lock? This is challenging because you should respect it in your code. Every process should understand that it can be dropped, and the server can fail. We need to know how to restore the process and understand the partial state that was implemented during the lock holding. Of course, there are potential deadlocks. For example, we want to hold a lock for transfer, for check, for customer, and to modify the state, we need to touch every object there. It can be challenging to understand and guarantee that there is no deadlock. For instance, we might get a lock for transfer and try to get a lock for the recipient, but another process started with the recipient and then tried to get a lock for the transfer. The second transaction got a lock for the recipient and then tried to get a lock for the transfer, but it holds this lock for the first node for the first process. It is quite challenging, and I don't know how to guarantee that there are no deadlocks at all, except through testing. However, testing is super complex in distributed systems, and I don't have so much time to test distributed systems on my own.

Let's try to switch to asynchronous processing. First, let's define what the transfer model is. Actually, it is a finite state machine because a transfer has multiple statuses and state transitions. Each state defines a load, comments, trigger actions, and state changes. That's all. So, a finite state machine (FSM) is just switching between states because of some events or comments. For example, we cannot go from payment received to payment waiting because it's impossible in our domain.

Let's take a look at the code. First of all, let's define transfer for simplicity; it can have only ID and status. What is FSM here? A state machine. So, we have a transfer created, we have a common receive payment, and we can move our transfer to payment received. Let's define statuses: we have created, payment received, check sent, expanding, and so on. All our tree of choices and transitions from status to status. Of course, we need to define a command. A command is something that can happen in your application. It can be done with HTTP endpoints, gRPC endpoints, broker messages, time, or whatever. In fact, we need just something to tell the transfer that it should move from one step to another.

Let's take a look at the code on how we could handle this command. Of course, we won't review everything here; we will start with something super simple. So, for the payment receive payment command, first of all, we should check the status. This is simplified; we ignore error handling here, error cases, and so on, just pure logic. We should check the status. If we've got a receive payment command, we know that the status should be only status created. If not, we should throw some error. We need to write payment details, set the status, save the state, and tell the transfer that we need to run some checks to move the transfer further.

Requirements for asynchronous processing: Now we know the model and understand that the transfer is an actual FSM model. We need commands, messages, and states. Let's make some requirements for our system. First of all, we need communication through messages. Messages are commands or events or whatever; this is something that we can tell from one service to another service that it needs to run something. Also, we need one-at-a-time message handling. This means that if we got a command for transfer A, only one command can be executed for this transfer. If we have a second, it should be queued somehow. We also need a durable message store, meaning that we should keep our messages for some time because of node failures, location failures, data center failures, and many more reasons. We would need to replay messages somehow, maybe for analytics or if we deployed something broken.