High-Load Systems Challenges in Social Network Development | Alexandr Kolobov | Conf42 PE 2024
Table of contents
- High-load systems demand custom solutions when traditional methods can't handle massive data and user volumes.
- Scaling your platform means constant tweaking, custom solutions, and zero downtime tolerance—it's a high-stakes game!
- Transforming PHP code into C++ can make it 7-10 times faster without rewriting the entire codebase.
High-load systems demand custom solutions when traditional methods can't handle massive data and user volumes.
I am Alexander Kov, and today I'm going to talk about high-load systems and the challenges they bring. My presentation will be based on my extensive experience in social network development and product management. If you're interested, fasten your seat belt and join the ride.
First, let me introduce myself and share some light on my experience. I worked as a team lead at one of the biggest social networks, where I led teams of up to 10 members, including specialists, analysts, and project managers. As a developer, I designed, developed, and maintained various features for the desktop and mobile web versions of a social network across backend, frontend, and mobile application APIs. I have redesigned the social network interface for multiple user sections, completely reworked network widgets for external sites, maintained privacy settings for closed profiles and content, overhauled the backend and frontend for the mail notification system handling millions of emails daily, and created a system for conducting NPS and CSI surveys that covered the two largest Russian social networks.
Today, I want to touch upon the following aspects of the topic. We'll briefly discuss how to define if the system is high-load or not. Then, we'll talk about the way high loads change the system requirements, and based on my expertise, I'll highlight what approaches and technologies can help overcome high-load challenges.
Let's begin with the definition. What system can we call high-load? A system is considered high-load if it meets several criteria. First of all, high request volume: the system must handle millions of requests daily. Second is a large user base: support millions of concurrent users. Next is extensive data management: manage terabytes or even petabytes of data. Performance and scalability: maintain responsiveness under increasing loads. Complex operations: perform resource-intensive calculations or data processing. High reliability: requires 99.9% or higher uptime. Geographical distribution: serve users across multiple locations with low latency. Concurrent processing: handle numerous concurrent operations. And finally, load balancing: distribute traffic effectively to avoid bottlenecks.
If I were to give a one-sentence definition of what a high-load system is, I would say that high-load is when usual methods for processing requests, storing data, and managing infrastructure are no longer enough, and there is a need to create custom solutions.
Let's take a look at VK social network loads. Here is what the system had to process a couple of years ago: 100 million users per month, 100 million posts and content creations per day, and 9 billion post views per day. The numbers you've seen on the previous slide result in this performance metric, so we can definitely call VK loads high.
Now that we have defined that our system is high-load, we can take a step further and look at the difficulties the management of such systems entails. First of all, it's performance or maintaining fast response times and processing under high-load conditions. Besides, we need to store, retrieve, and process large volumes of data effectively. Then, we'll have to ensure that scalability is possible at any stage, ensuring the system remains operational and available despite high traffic and potential failures. And definitely, we need to build systems that can recover from failures and continue to operate smoothly.
Apart from challenges, high-load systems bring certain risks, and that is why we have to question some of the traditional tools. The main issue with external solutions is that they are not highly optimized; instead, they are designed for broad market applicability, which often comes at the expense of performance. There is also an issue with security. On one hand, external solutions are usually well-tested due to their vast user base, but on the other hand, fixing identified issues quickly and precisely is challenging. Moreover, updating to a fixed version might lead to compatibility problems. External solutions also require ongoing tweaking and fixing, which is very difficult unless you're a committer of that solution. Finally, they may not scale effectively. Naturally, with the growing loads, reliability, data management, and scaling requirements are increasing. For example, in the past, downtime for maintenance was acceptable; users had lower expectations and fewer alternatives. Today, with the vast availability of online services and high competition, the expectations are much higher.
Scaling your platform means constant tweaking, custom solutions, and zero downtime tolerance—it's a high-stakes game!
The use of external solutions often comes at the expense of performance. There is also an issue with security. On one hand, external solutions are usually well-tested due to their wide user base, but on the other hand, fixing identified issues quickly and precisely is challenging. Moreover, updating to a fixed version might lead to compatibility problems. External solutions also require ongoing tweaking and fixing, which is very difficult unless you're a committer of that solution. Finally, they may not scale effectively with growing loads.
Reliability, data management, and scaling requirements are increasing. For example, in the past, downtime for maintenance was acceptable as users had lower expectations and fewer alternatives. Today, with the vast availability of online services and high competition among them, even short periods of downtime can lead to significant user dissatisfaction and negatively affect the net promoter score. Quick recovery used to mean a service was never considered down. Today, according to the five nines standard, it means 99.999% uptime, which is often referenced in the tech industry. Only about 5 minutes of downtime per year are considered acceptable. Also, users previously kept backups, but now cloud services must ensure zero data loss. While systems were once planned in advance, there is now a need for them to scale linearly at any moment due to possible explosive audience growth.
Convenience and time to market in a competitive environment make it essential to launch features quickly and frequently. Further on, we'll discuss some possible ways to overcome these challenges and meet the high load requirements. Here you can see how VK social network grew and gradually transformed its architecture and adapted or created technologies that suited the scale and new requirements. As the platform grew and attracted a large audience, numerous bottlenecks appeared, and optimizations became a necessity. Databases could no longer keep up, and the project codebase became too rigid and slow. The volume of user-generated content also increased, creating new bottlenecks.
In normal-sized projects, traditional databases like MySQL and PostgreSQL can meet all your needs. However, in high-load projects, each need often requires a separate data storage solution. As the load increased, it became crucial to switch to custom highly specialized databases with data stored in simple, fast, low-level structures. In 2009, when relational databases couldn't effectively handle the growing load, the team started developing their own data storage engines. These engines function as microservices with embedded databases written in C or C++. Currently, there are about eight engines and giant clusters, each responsible for its own logic such as messages, recommendations, photos, hints, letters, lists, logs, news, etc. For each task needing a specific data structure or unusual queries, the team creates a new engine. The custom engines proved to be much more efficient and use simple data structures. In some cases, they store data as nearly bare indexes, leading to minimal structuring and processing at the reading stage. This approach increases data access and processing speed.
We also get more efficient data replication and sharding. Reliance on master-slave replication and strict data-level sharding enables horizontal scaling without any issues. All data is heavily cached, often precomputed in advance. Caches are shared with custom wrappers for automatic keys count calculation on the code level. In large systems like ours, caching moves from merely improving performance as the main goal to reducing load on the backend.
The next challenge was optimizing the application code itself. It was written in PHP and became too slow, but changing the language was impossible with millions of lines of code in the project. So, the goal of KPHP compiler is to transform PHP code into C++. This approach boosts performance without the extensive problems associated with rewriting the entire codebase. The team started improving the system from bottlenecks, and for them, it was the language, not the code itself. Finally, in real production environments, KPHP proved to be from 7 to 10 times faster than just simple PHP. PHP was adopted as the backend of VK. By now, it supports PHP 7 and 8 features, making it compatible with modern PHP standards. KPHP allows fast compilation and efficient development.
Transforming PHP code into C++ can make it 7-10 times faster without rewriting the entire codebase.
The application code itself was written in PHP and became too slow. However, changing the language was impossible due to millions of lines of code in the project. The goal of KPHP is to transform PHP code into C++. This approach boosts performance without the extensive problems associated with rewriting the entire code base. The team started improving the system from the bottleneck, which for them was the language, not the code itself. Finally, in real production environments, KPHP proved to be from 7 to 10 times faster than simple PHP.
PHP was adopted as the backend of VK, and it now supports PHP 7 and 8 features, making it compatible with modern PHP standards. KPHP allows fast compilation and efficient development cycles. It enforces strict typing to reduce bugs and improve code quality and uses shared memory for efficient management. The system supports parallelization and coroutines, allowing multiple processes to run simultaneously and optimizing code through it. It enhances performance on systems with non-uniform memory access.
With the NoVerify linter, the team enhanced code quality and reliability. NoVerify is designed for large codebases and focuses on analyzing diffs before they are pushed. It indexes around 1 million lines of code per second and analyzes about 100,000 lines per second. Additionally, it can run on standard PHP projects. The application was also partly transitioned to a microservices architecture to accelerate time to market. This shift allowed the development of services in various programming languages and provided greater flexibility in the development process. By breaking down the system into smaller independent services, it became possible to deploy updates more rapidly and ensure that each component can be optimized with the most suitable technology.
After optimizing the database and code, the team began breaking the project into optimized microservices. The focus shifted to addressing the most significant bottlenecks in content storage and delivery. Images emerged as a critical bottleneck in the social network. The problem was that the same image needed to be displayed in multiple sizes due to interface requirements and different platforms (mobile with retina, non-retina, web, and so on). Before the changes, resized images occupied a significant amount of storage space. The team eliminated preset sizes and instead implemented dynamic resizing. They introduced another microservice called Image Processor that generates required sizes on the fly. Additionally, they transitioned to serving images in WebP format, which was very cost-effective.
The optimizations led to significant improvements, as highlighted on the slide. It's always worth identifying and optimizing the biggest bottlenecks for better performance. While the choice of technologies is unique for each high-load company, many approaches overlap and demonstrate effectiveness across the board. For example, Netflix uses a combination of microservices and distributed architecture to deliver content efficiently. They implemented caching strategies using EVCache and have developed their own data storage solutions. As one of Russia's largest tech companies, Yandex uses a variety of in-house databases and caching solutions to manage its search engine and other services.
I cannot but mention ClickHouse here, a highly specialized database developed by Yandex to meet its specific needs. This solution proved to be so fast and efficient that it's now widely used by others. Yandex created an open-source database management system that stores and processes data by columns rather than rows, and its high-performance query processing makes it ideal for handling large volumes of data and real-time analytics. LinkedIn implements a distributed storage system called Espresso for its real-time data needs and leverages caching with Apache Kafka to manage high-throughput messaging. Twitter employs a custom-built storage solution called Manhattan, designed to handle large volumes of tweets and user data.
Wrapping up, let's quickly revise what we discussed today. First of all, high-load systems are applications built to support a large number of users and transactions simultaneously, requiring excellent performance and reliability. Second, the challenges of high-load systems include scalability, reliability issues, performance slowdowns, and complicated integrations. High-load systems have specific requirements, such as preventing data loss, allowing fast feature updates, and keeping downtime to a minimum. Using external solutions can become risky in high-load environments, so there is often a need for custom solutions. When beginning optimizations, you need to identify the key bottlenecks and then find ways to approach them.
Among the technologies, high-load systems usually rely on effective scalable data storage with good caching, compiled languages, distributed architecture, and good tooling. Still, there are no fixed rules for creating high-load applications; it's always an experimental process. Thank you for your time. If you have any questions on the topic, I will be glad to answer them. Feel free to message me on Telegram. Bye.