Fortune Telling Collection - Comprehensive fortune-telling - How to improve the overall performance of block device IO under Linux

How to improve the overall performance of block device IO under Linux

This paper mainly explains three modes of Linux IO scheduling layer: cfp, deadline and noop, and gives their respective optimization and applicable scenario suggestions.

IO scheduling occurs in the IO scheduling layer of the Linux kernel. This level is aimed at the overall IO level of Linux. From the perspective of read () or write () system call, the overall IO system of Linux can be divided into seven layers, namely:

VFS layer: Virtual file system layer. Because the kernel has to deal with a variety of file systems, each file system may have different data structures and related methods, so the kernel abstracts this layer, which is specially used to adapt to various file systems and provide a unified operation interface.

File system layer: Different file systems implement their own operation processes and provide their own unique functions. I don't need to say much about it. You can read the code yourself if you want.

Page caching layer: responsible for caching pages.

Universal block layer: Because most io operations are processing block devices, Linux provides a block device operation abstraction layer similar to vfs layer. The lower layer provides a unified block IO request standard for various block devices with different attributes.

IO scheduling layer: Since most block devices are disks and the like, it is necessary to set some different scheduling algorithms and queues according to the characteristics of such devices and the different characteristics of applications. In order to improve the reading and writing efficiency of disks in different application environments, this is where the famous Linux elevator comes into play. Various scheduling methods of mechanical hard disk are realized here.

Block device driver layer: the driver layer provides a relatively advanced device operation interface, often in C language, while the lower layer interfaces the operation methods and specifications of the device itself.

Block device layer: this layer is a concrete physical device, which defines various operation methods and specifications of real devices.

There is an organized [Linux IO Structure Diagram], which is very classic, and a picture is worth a thousand words:

What we are going to study today is mainly the IO scheduling layer.

The core problem to be solved is how to improve the overall performance of block device IO. This layer is also mainly designed for the structure of mechanical hard disk.

As we all know, the storage medium of mechanical hard disk is disk, and the magnetic head moves the addressing track on the disk, which is similar to playing a record.

This structure is characterized by high throughput in sequential access, but if there is random access to the disk, it will waste a lot of time on the movement of the head, resulting in a longer response time for each IO and greatly reducing the response speed of IO.

The seek operation of the magnetic head on the disk is similar to elevator scheduling. In fact, at the beginning, Linux named this algorithm Linux elevator algorithm, namely:

If we can "conveniently" process the data requests of all the relevant tracks that pass by in turn during the seek process, then the throughput of the overall IO can be improved with little impact on the response speed.

This is why we should design the IO scheduling algorithm.

At present, the kernel turns on three algorithms/modes by default: noop, cfq and deadline. Strictly speaking, there should be two kinds:

Because the first one is called noop, it is a null operation scheduling algorithm, that is, there is no scheduling operation, io requests are not sorted, and only one fifo queue is used for proper io merging.

At present, the default scheduling algorithm of kernel should be cfq, which is the so-called completely fair queue scheduling. As the name implies, this scheduling algorithm attempts to provide a completely fair IO operating environment for all processes.

Note: please remember this word, cfq, completely fair queue scheduling, otherwise you can't watch it.

Cfq creates a synchronous IO scheduling queue for each process, and allocates IO resources by default in the way of time slice and request number limitation, so as to ensure that the IO resource occupation of each process is fair. Cfq also implements process-level priority scheduling, which we will explain in detail later.

View and modify the IO scheduling algorithm is:

Cfq is a better choice of IO scheduling algorithm for general-purpose servers, and it is also a better choice for desktop users.

However, it is not very suitable for many scenarios with high IO pressure, especially those where the IO pressure is concentrated on certain processes.

Because of this scenario, we need to meet the io response speed of one or several processes more, rather than let all processes use IO fairly, such as database applications.

Deadline scheduling is a more suitable solution for the above scenario. The deadline has realized four queues:

Two of them deal with normal reading and writing respectively, sort by sector number, and merge normal io to improve throughput. Because io requests may be concentrated in some disk locations, new requests will be merged all the time, and IO requests in other disk locations may starve to death.

The other two queues that handle overtime reading and writing are sorted by request creation time, and if overtime requests appear, they are put into these two queues. The scheduling algorithm ensures that requests in the queue that exceed the deadline will be processed first to prevent the requests from starving to death.

Not long ago, the kernel also came standard with four algorithms by default, and another algorithm is called as(predictive scheduler), predictive scheduling algorithm. A tall name makes me think that the Linux kernel will tell fortune.

The results show that io scheduling based on deadline algorithm is nothing more than waiting for a short time. If there are io requests that can be merged during this period, they can be merged to improve the data throughput of deadline scheduling in the case of sequential reading and writing.

In fact, this is not a prediction at all. I think it's better to call the Universiade scheduling algorithm. Of course, this strategy is not effective in some specific scenarios.

But in most scenarios, this scheduling not only does not improve throughput, but also reduces the response speed, so the kernel simply deletes it from the default configuration. After all, the purpose of Linux is practical, and we will not waste our breath on this scheduling algorithm.

1, cfq: completely fair queue scheduling

Cfq is the default IO scheduling queue selected by the kernel, which is a good choice in desktop application scenarios and the most common application scenarios.

How to achieve a so-called completely fair queue?

First of all, we need to understand who is fair to? From the point of view of operating system, the main body of operation behavior is the process, so the fairness here is for each process, and the process should be allowed to occupy IO resources fairly as far as possible.

So how to make the process occupy IO resources fairly? We must first understand what IO resources are. When we measure an IO resource, we generally like to use two units, one is the bandwidth of data reading and writing, and the other is the IOPS of data reading and writing.

Bandwidth is the amount of data read and written in time, for example, 100 megabyte/second. IOPS is the number of reads and writes in time. Under different reading and writing conditions, the performance of these two units may be different, but it is certain that any one of the two units will reach the performance limit and will become the bottleneck of IO.

Considering the structure of mechanical hard disk, if reading and writing are sequential, the performance of IO is that it can achieve larger bandwidth with less IOPS, because many IO can be merged, and the data reading efficiency can be accelerated by pre-reading.

When the performance of io is biased towards random reading and writing, IOPS will increase and the possibility of merging IO requests will decrease. The less data per IO request, the lower the bandwidth performance.

From this, we can understand that there are two main forms of IO resources of a process: the number of IO requests submitted by the process per unit time and the bandwidth occupied by the process.

In fact, no matter which one, it is closely related to the length of IO processing time allocated by the process.

Sometimes services can occupy more bandwidth and less IOPS, while other services may occupy less bandwidth and more IOPS, so it is relatively fair to schedule a process to occupy IO time.

In other words, I don't care whether your IOPS is high or your bandwidth is high. Then we'll switch to the next process, whatever you want.

Therefore, cfq tries to allocate the same time slice to all processes. During the time slice, the process can submit the generated IO request to the block device for processing. At the end of the time slice, the process's request will be queued in its own queue for processing at the next scheduled time. This is the basic principle of cfq.

Of course, there can be no real "fairness" in real life. In common application scenarios, we are willing to assign priority to the IO occupation of a process manually, just like setting priority to the CPU occupation of a process.

Therefore, in addition to the fair queue scheduling of time slices, cfq also provides priority support. Each process can set an IO priority, and cfq will take this priority setting as an important reference factor for scheduling.

Priority is first divided into RT, BE and IDLE, which are real-time, best-effort and idle, and different strategies are adopted to deal with IO and cfq of each category. In addition, in RT and BE categories, eight sub-priorities are divided to achieve more detailed QOS requirements, while IDLE has only one sub-priority.

In addition, as we all know, by default, the kernel reads and writes the storage in the buffer/cache. In this case, cfq can't distinguish which process the currently processed request comes from.

Only when a process reads and writes in synchronous mode (sync read or sync wirte) or direct IO mode can cfq distinguish which process the IO request comes from.

Therefore, in addition to the IO queue implemented for each process, a public queue is also implemented to handle asynchronous requests.

At present, the kernel has realized cgroup resource isolation for IO resources, so cfq has also realized scheduling support for cgroup on the basis of the above system.

Generally speaking, cfq uses a series of data structures to support all the above complex functions. You can see the related implementation through the source code. The file is located in the source code directory of Block/CFQ-IOSched. C.

Design principle of 1. 1 cfq

Here, we make a simple description of the overall data structure: First, cfq maintains the entire scheduler process through a data structure named cfq_data. In cfq supporting cgroup function, all processes are divided into several control groups for management.

Each cgroup is described by cfq _ group structure of CFQ, and all CGroups are put into a red-black tree as scheduling objects, and sorted with vdisktime as the key word.

The time of vdisktime records the io time occupied by the current cgroup. Every time a cgroup is scheduled, the cgroup with the least current vdisktime is always selected through the red-black tree for processing, so as to ensure the "fairness" of IO resource occupation among all CGroups.

Of course, we know that cgroup can allocate resources to blkio in proportion, and its working principle is that the time occupied by cgroup with large allocation ratio grows slowly, while the time occupied by vdisktime with small allocation ratio grows rapidly, and the speed is proportional to the allocation ratio.

In this way, the proportion of IO allocated by different cgroup is different, and it is "fair" from the perspective of cfq.

After selecting the cgroup(cfq_group) to be processed, the scheduler needs to decide to choose the next service_tree.

The data structure service_tree corresponds to a series of red and black trees, and its main purpose is to realize the classification of request priority, that is, the classification of RT, BE and IDLE. Each cfq_group maintains seven service trees, which are defined as follows:

Among them, service_tree_IDLE is a red-black tree, which is used to queue idle requests.

In the above two-dimensional array, firstly, the first dimension implements an array for RT and BE respectively, and each array maintains three red and black trees, corresponding to three different request subtypes, namely, SYNC, SYNC_NOIDLE and ASYC.

We can think that SYNC is equivalent to SYNC_IDLE and corresponds to sync _ noiidle. Idle is a mechanism added by cfq to combine continuous IO requests as much as possible to improve throughput. We can understand it as an "idle" waiting mechanism.

Idle means that when a queue finishes processing a request, it will wait for a period of time before scheduling. If the next request comes, it can reduce the header addressing and continue to process sequential IO requests.

In order to realize this function, cfq implements a synchronous queue in the data structure of service_tree. If the request is a synchronous sequential request, it will be queued in this service tree. If the request is a synchronous random request, it will be queued in the SYNC_NOIDLE queue to determine whether the next request is a sequential request.

All asynchronous write requests will be queued in the service tree of async, and there is no idle waiting mechanism in this queue.

In addition, cfq has also made special adjustments to hard disks such as ssd. When cfq finds that the storage device is a device with a deeper queue, such as SSD hard disk, all the idleness for a single queue will not take effect, and all IO requests will be queued in the service tree of SYNC_NOIDLE.

Each service tree corresponds to several cfq_queue queues, and each cfq_queue queue corresponds to a process, which we will explain in detail later.

Cfq_group also maintains an asynchronous IO request queue, which is shared by all processes in cgroup. Its structure is as follows:

Asynchronous requests are also divided into three categories: RT, BE and IDLE, and each category is queued with a cfq_queue.

BE and RT also support priority. Each type has multiple priorities, such as IOPRIO_BE_NR. This value is defined as 8, and the array subscripts are 0-7.

The kernel code version we are currently analyzing is Linux 4.4. It can be seen that from the point of view of cfq, cgroup support for asynchronous IO has been realized. We need to define the meaning of asynchronous IO here. Asynchronous IO only refers to the IO request to synchronize data from the buffer/cache in memory to the hard disk. Instead of aio(man 7 aio) or linux's native asynchronous io and libaio mechanisms, in fact, these so-called "asynchronous" io mechanisms are all realized synchronously in the kernel (Von Neumann computer has no real "asynchronous" mechanism in essence).

As we explained above, since the process usually writes data into the buffer/cache first, this asynchronous IO is handled by the asynchronous request queue in cfq_group.

So why implement and an asynchronous type in the service_tree above?

Of course, this is to support asynchronous IO that distinguishes processes and prepare for "complete fairness".

In fact, in the latest blkio system of cgroup v2, the kernel has already supported cgroup's speed limit support for buffer IO, and these types that may be easily confused are all type labels that need to be used in the new system.

The new system is more complex and powerful, but don't worry, the official cgroup v2 system will be officially seen when Linux 4.5 is released.

Let's continue the process of selecting service_tree. The choice of three priority types of service_tree is based on the priority of the types, RT has the highest priority, BE takes the second place, and IDLE is the lowest. That is to say, if there is one RT, it will always BE processed, and if there is no RT, it will be processed.

Each service_tree corresponds to a red-black tree with an element of cfq_queue, and each cfq_queue is a request queue created by the kernel for a process (thread).

Each cfq_queue maintains a variable rb_key, which is actually the IO service time of this queue.

Here, the cfq_queue with the shortest service time is found through the red and black tree to ensure "complete fairness".

After selecting cfq_queue, it's time to start processing IO requests in this queue. The scheduling method here is basically similar to deadline.

Cfq_queue queues each request that enters the queue twice, once in the fifo and once in the red-black tree, with the access sector order as the key.

By default, requests are extracted from the red-black tree for processing. When the delay time of the request reaches deadline, the request with the longest waiting time is taken out from the red-black tree for processing to ensure that the request will not starve to death.

This is the whole cfq scheduling process. Of course, there are still many details that have not been explained, such as merge processing and sequential processing.

Parameter adjustment of 1.2 cfq

Understanding the whole scheduling process helps us to decide how to adjust the relevant parameters of cfq. All adjustable parameters of cfq can be found in the directory/sys/class/block/sda/queue/iosched/. Of course, on your system, please replace SDA with the corresponding disk name. Let's see what we have:

Some of these parameters are related to the seek mode of the mechanical hard disk head. If you can't understand the display, please add relevant knowledge first:

Back_seek_max: the maximum range that the head can address backwards. The default value is 16M.

Back_seek_penalty: penalty coefficient of reverse addressing. This value is compared with forward addressing.

The above two settings are to prevent slow addressing caused by head seek jitter. The basic idea is that when an io request arrives, cfq will estimate its head seek cost according to its addressing position.

Set the maximum value of back_seek_max. As long as the addressing range does not exceed this value, cfq will treat it as a request for forward addressing.

When a coefficient back_seek_penalty is set to evaluate the cost, and the distance of backward addressing is1/2 (1/back _ seek _ penalty) relative to the head forward addressing, cfq thinks that the cost of the two requests is the same.

These two parameters are actually the conditional restrictions for cfq to judge the request merging process. During this request processing, all requests that combine this condition will be merged as much as possible.

Fifo_expire_async: Sets the asynchronous request timeout.

Synchronous requests and asynchronous requests are handled in different queues. Generally speaking, during scheduling, cfq will give priority to synchronous requests, and then handle asynchronous requests, unless asynchronous requests meet the above merger conditions and restrictions.

When the queue of this process is scheduled, cfq will first check whether there is any asynchronous request timeout, that is, it exceeds the limit of fifo_expire_async parameter. If there is, a timeout request will be sent first, and the remaining requests will still be processed according to the priority and sector number.

Fifo_expire_sync: This parameter is similar to the above, only used to set the timeout of synchronization request.

Slice_idle: This parameter sets the waiting time. This makes cfq have to wait a while when switching cfq_queue or service tree to improve the throughput of mechanical hard disk.

Generally, IO requests from the same cfq_queue or service tree have better addressing locality, so this can reduce the number of disk addressing. This value defaults to a non-zero value on the mechanical hard disk.

Of course, setting this value to non-zero on SSD or hard disk RAID devices will reduce storage efficiency, because SSD has no concept of header addressing, so it should be set to 0 on such devices and this function should be turned off.

Group_idle: This parameter is similar to the previous one, except that cfq will wait for a while when switching cfq_group.

In the scenario of cgroup, if the way of slice_idle is followed, every process in cgroup may have idle waiting when switching cfq_queue.

In this way, if the process always has requests to process, other processes in the same group may not be scheduled until the quota of the cgroup is exhausted. This will cause other processes in the same group to starve to death, resulting in IO performance bottleneck.

In this case, we can set slice _ idle = 0 and group _ idle = 8. In order to prevent the above problems, this idle waiting is done in cgroup, not in the process of cfq_queue.

Low_latency: This switch is used to turn on or off the low-latency mode of cfq.

When this switch is turned on, cfq will recalculate the slicing time of each process according to the parameter setting of target_latency.

This will be beneficial to the fairness of throughput (the default is the fairness of time slice allocation).

Turning this parameter off (set to 0) will ignore the value of target_latency. This will enable the processes in the system to allocate IO resources completely according to the time slice method. By default, this switch is on.

We already know that there is a concept of "idling" in cfq design. In order to combine as many continuous read and write operations as possible, the addressing operation of the head is reduced in order to increase the throughput.

If a process always reads and writes quickly in sequence, it will slow down the response speed of other processes that need to handle IO because of the high idle wait hit rate of cfq. If another process that needs to be scheduled does not issue a lot of sequential IO behaviors, the IO throughput performance of different processes in the system will be very uneven.

For example, when there are many dirty pages in the cache of system memory to be written back, the desktop must open the browser to operate. At this time, the background behavior of dirty page write-back is likely to hit a lot of idle time, causing a small number of IO in the browser to wait all the time, making users feel that the browser has a slow response speed.

This low delay is mainly an option to optimize this situation. When it is turned on, the system will limit the processes that occupy a lot of IO throughput due to hit idling according to the configuration of target_latency, so as to achieve a relative balance of IO throughput occupied by different processes. This switch is more suitable for opening in scenes similar to desktop applications.

Target_latency: When the value of low_latency is on, cfq will recalculate the IO slot length allocated by each process according to this value.

Quantum: This parameter is used to set how many IO requests are processed from cfq_queue at a time. During the queue processing event cycle, IO requests exceeding this number will not be processed. This parameter is valid only for synchronization requests.

Slice_sync: When the cfq_queue queue is scheduled for processing, the total processing time that can be allocated to it is specified by this value as a calculation parameter. The formula is: time _ slice = slice _ sync+(slice _ sync/5 * (4-prio)). This parameter is valid for synchronization requests.

Slice_async: This value is similar to the previous value, except that it is valid for asynchronous requests.

Slice_async_rq: This parameter is used to limit the maximum number of asynchronous requests that a queue can handle within a time slice. The maximum number of requests to be processed is also related to the io priority set by related processes.

IOPS mode of 1.3 cfq

We already know that by default, cfq is a priority scheduling supported by time slices to ensure the fairness of IO resource occupation.

The process with high priority will get a longer time slice length, while the process with low priority will get a shorter time slice.

When our storage is a high-speed device supporting NCQ (Native Instruction Queue), in order to improve the utilization of NCQ, we'd better let it handle multiple requests from multiple cfq queues.

At this time, it is not appropriate to use time slice allocation to allocate resources, because based on time slice allocation, only one request queue can be processed at the same time.

At this point, we need to switch cfq mode to IOPS mode. The switching mode is simple, that is, slice_idle=0. The kernel will automatically detect whether your storage device supports NCQ, and if so, cfq will automatically switch to IOPS mode.

In addition, in the default priority-based time slice mode, we can use the ionice command to adjust the IO priority of the process. The default IO priority assigned by a process is calculated according to the nice value of the process, and the calculation method can be seen in Manionics, so I won't talk nonsense here.

2. Deadline: Deadline plan

Deadline scheduling algorithm is much simpler than cfq. Its design objectives are:

While ensuring that requests are accessed in the order of device sectors, other requests should not starve to death and should be scheduled before the deadline.

As we know, the seek of the magnetic head to the disk can be accessed sequentially and randomly. Due to the delay time of seek, the throughput of IO is larger in sequential access and smaller in random access.

If we want to optimize the throughput of a mechanical hard disk, then we can let the scheduler sort the accessed IO requests in composite order as much as possible, and then send the requests to the hard disk in this order, which can increase the IO throughput.

However, there is still a problem in doing this, that is, if there is a request at this time, the track it wants to access is far from the track where the current head is located, and a large number of application requests are concentrated near the current track.

Therefore, a large number of requests will always be merged and queued for processing, while requests to access relatively distant tracks will starve to death because they have never been scheduled.

Deadline is such a scheduler, which can not only ensure the maximum throughput of IO, but also try to schedule remote requests within a limited time without starving to death.