Sunday, 8 September 2013

Scalability Strategies

I want to highlight some strategies that helps building scalable applications without much pain. I will try to shed some light on data decomposition, separated reader/writer and false sharing. These are principles affecting an application performance from top to down.

Data decomposition

Data decomposition is a top-down technique for breaking work into small tasks that can be parallelized. That strategy allows creating data chunks which can be processed in parallel. This strategy is generally applied in database systems, for example both Cassandra and MongoDB use that strategy to scale up. Both databases have a partition strategy which calculates a hash value and then uses that value to locate a node in the cluster.
There are also real life examples showing that strategy works very well and not applying it causes lot of trouble.
For example, I use Tottenham Court Rd tube station for my daily commuting routine and observe how life can be painful when not following decomposition strategies. This station has a notorious access to northern line users, requiring them to go to Central line platform then pass to the opposite to access northern line, moreover passenger traffic on the platform is two ways which causes hundreds of people bumping to each other during peak hours, although they are interested in different routes.
Blocking platform entries

Single Writer

That strategy states using a single writer to change a state of a resource frequently used in an application, helping to eliminate memory contention which can be seen in today’s multi-core processors. Synchronizing data among different cores has big performance penalty because of bus traffic, cache coherency, locking cost and cache misses. With single writer, a cpu can utilize modern cpu advancements, knowing that there is no other writers, it doesn't need to force a sync with RAM and can optimize L1 cache usage. Moreover with single writer there is not a lock acquire/release stage that happen with multi writers.
Synchronization
A single picture can explain single writer better than writing hundreds of lines. Let’s assume there are four baton runners which never gets tired and there is not a requirement to exchange the baton at the exchange point. Instead of having multiple runners, only one runner, never tiring, can carry the baton easily. In the second case, the runner doesn't need to pass the baton at the exchange point. We can draw an analogy between multiple runners and multiple threads, in both case there is a synchronization overhead and exchange. Similarly we can only have one thread modifying a shared resource so that it eliminates the synchronization work.

False Sharing

False sharing is notoriously known as the silent performance killer. It arise in parallel applications when two threads operate on an independent data in the same memory address region, stored on the same cache line. The cache coherency protocol may invalidate the whole cache line on each write, leading to memory stalls and occupying the bus or the interconnect.
Detecting this issue is quite a daunting task, and involves inspecting the memory layout of the classes and analysing the CPU cache lines. After identifying that the application indeed suffers from false sharing.
Let’s assume we have a message-id resource which is incremented every time when it is accessed by a single writer thread. With single writer principle, one can expect to have some performance improvements for message-id number generation as there is not any lock release/acquire steps, which would have happen between multiple threads.
However modern processors  have multiple cores that include cache lines, these cache lines generally store 64 bytes, therefore a CPU can align a number of variables into the same cache line regardless of their frequency access. Each time a cache line is updated, it has to be invalided, flushed to the RAM and then synced with other core’s cache lines.
Unfortunately there isn't any native way to avoid sharing a cache line with other variables up to JDK7, this issue has been fixed in JDK8 with the @Contented annotation.
However there are still ways to ensure that a variable is not shared with other variables in the same cache line, one of them is to make the shared variable big enough to occupy whole cache line so that another variable cannot fit into it. That’s generally achieved by creating an array of 64byte size, which is a 8-size long array in Java, then the shared variable is stored in one of the array position.
Unordered items
Cache line alignment
Here is a fruit basket containing mixture of fruits, imagine how easy to load/unload just apples amongst them compared to load/unload a rack having each fruit in a different rack?
This is like variables sharing the same cache line. The more we have control how data is decomposed, the easier to improve application performance.