Design and Implementation of an Efficient I/O Method for a Real-time User Level Thread Library Kota ABE
Toshio MATSUURA Media Center Osaka City University Osaka 558-8585, Japan
Keiichi YASUMOTO Faculty of Economics Shiga University Shiga 522-8522, Japan
Teruo HIGASHINO Graduate School of Engineering Science Osaka University Osaka 560-8531, Japan
Abstract We have developed a portable user-level thread library RT-PTL on UNIX, which is intended to be used for soft realtime processings, such as multimedia systems. A user-level thread is known to be more efficient than a kernel-level one because of its small context switch overhead. However, user-level thread has a restriction that if one thread issues a system call that forces the calling process to be entirely blocked (such as disk I/O operation), no other thread is runnable until the system call has completed. This characteristic is undesirable for real-time processing. In this article, we have shown a method to overcome this restriction. Using this method, if one thread requests a system call such as disk I/O operation, only the thread that issues the request is blocked and other threads can continue execution. We have implemented this method and have confirmed by some experiments that the method is widely usable and is also efficient.
1. Introduction Multimedia systems which handle video and audio streams require real-timeness. Moreover, to handle multiple data streams concurrently, they must also have a capability of efficient concurrent processing. In general, to handle concurrent processing, it is known that multi-threaded programs are more efficient than traditional programs consisting of multiple single-threaded processes because of its smaller context switch and interprocess communication overhead. Thus, this characteristic of multi-threading is suitable for real-time processing. Some real-time operating systems such as RT-Mach provide real-time threads. However, unfortunately, these real-time operating systems are not so popular as UNIXes. Therefore, such real-time threads cannot be widely used.
For this reason, we have developed a real-time threads mechanism RT-PTL atop of general UNIX. To obtain portability and efficiency, RT-PTL adopts a user level thread mechanism. In user level thread mechanism, all thread controls are done in user level library codes and no kernel modification is required. Because general UNIX does not guarantee real-timeness, RT-PTL is designed to use for soft realtime processing and performs with best-effort policy. This property is still useful in some areas such as multimedia systems. RT-PTL is based on our previous work, PTL. The main characteristics of PTL are (1) portability (CPU independent under BSD UNIX); (2) based on POSIX 1003.1c (Pthread), standard application programming interface; and (3) provision of preemptive scheduling. PTL is stable and freely available. We have used PTL for our LOTOS compiler’s runtime support. In addition to all functions of PTL, RT-PTL has the following capabilities: (1) EDF (Earliest Deadline First) scheduling and (2) deadline miss detection mechanism. However, user level thread mechanisms has a restriction that if a thread issues a system call which takes some time to complete, whole the process is blocked until the system call is completed. Therefore, once a thread issues such a system call, no other threads can proceed with its execution while the system call is in progress. Especially, system calls for disk I/O operations take a long time to complete because it may wait for physical I/O devices (on NFS, it waits for NFS server’s response). Owing to this characteristic, user level threads can be restrictively applied even for soft real-time systems. In this paper, we propose an effective method to avoid blocking in file I/O operations under user level thread mechanisms. We have implemented the method and integrated into RT-PTL. We call it PTL-N. PTL-N remains the same portability level as RT-PTL. And PTL-N is compatible with RT-PTL so that any programs running under RT-PTL can also run under PTL-N without any modifications (trans-
parency). Under PTL-N, each I/O operation does not block the process. This is done by preparing an I/O server process separately, which is dedicated to I/O operations. Generally, this process separation introduces communication overhead between processes. However, we have developed a method to decrease this overhead. Through experiments, we have confirmed quantitatively that avoidance of I/O blocking yields good results. Process blocking can be avoided, and nearly 90% of I/O time is usable for other threads. And the effect of decreasing the communication overhead shows that I/O performance with PTL-N is comparable to a normal method without PTL-N.
2. Basic idea and implementation issues 2.1. Related works
2.2. Outline of I/O processing In general UNIXes, there is no way for a process to avoid blocking during execution of a file I/O system call unless the UNIX kernel is modified. To cope with this problem, we propose a technique to handle file I/Os in another process dedicated for file operations. Hereafter, we call this specific process I/O server and the original process main process, respectively. The main process communicates with the I/O server to perform file I/Os. When the I/O server receives the requests, it issues system calls for file I/Os. When the system call completes, it sends a signal to the main process to notify the I/O completion. (We call this signal completion signal). Fig.1 depicts the overview of the proposed mechanism. Main process
PTL-N I/O Wrapper Functions
In our approach (Fig.1), data transfer between the main process and the I/O sever is needed. This data transfer could be overhead. Although several kbytes data is usually dealt with in each file I/O, several Mbytes data also may be used some time. Therefore, it is quite important how to transfer I/O data efficiently between processes.
Reduce overhead caused by signal interruption The main process should be notified by a completion signal when each requested file I/O has completed at the I/O server. Frequent signal interruption may bring overhead. Therefore, the signal interruption should be reduced as far as possible.
3. Improving performance of PTL-N
Many user level thread mechanisms have been implemented on UNIX[3, 4, 7, 5, 8]. In existing user level thread mechanisms, it is popular to avoid I/O blocking using nonblocking I/O mode provided by UNIX. Unfortunately, this technique is not effective if disk I/Os are requested. As far as we know, there are no user level thread mechanisms which could avoid such blocking by disk I/Os.
Improve efficiency of I/O data transfer
I/O st ue req /I O se n po res
read write File
read write File
3.1. Improving performance of data transfer The fastest communication method between processes under UNIX is using a shared memory facility. However, using this method, it is necessary to transfer I/O data between the shared memory segment and the data buffer in the main process. The overhead of the copy cannot be ignored if the size of I/O data is large. In order to avoid the overhead, we have devised the new method of sharing data not only in the shared memory segment but in a process’s private data area between two processes. Using this method, we can get rid of the overhead of the data copy, since I/O server can directly access the data area of the main process. However, the address space for each process is separated in UNIX, it is not obvious to share address space between processes. We have found a way to share address spaces between processes, using mmap system call on UNIX. The mmap system call maps a specified file into a process’s private memory space. Using this method, address spaces of the I/O server and the main process can be shared. For example, when an I/O server receives an I/O request (READ) from the main process, it issues read system calls. UNIX kernel puts data into the memory space in the I/O server, however, it is automatically reflected in the memory area of the main process without any copy operation. We call this method copy optimization.
3.2. Reducing overheads of I/O completion notification
Figure 1. Outline of I/O processing in PTL-N When we adopt the above mechanism, the following items should be considered to achieve good I/O throughput.
Usually, in the main process, after a thread issues an I/O request, other threads are running. In order to activate the thread which have issued an I/O request, the main process
Main process Send I/O request; If sending an I/O request causes a preemption, the main process is blocked here. At this point, requested I/O is possibly completed.
conte xt switch
t contex switch
Receiving for I/O request; (received) read / write; send signal;
Figure 2. Control flow between processes
4. Evaluation We have implemented PTL-N and evaluated it through some experiments.
with either I/O data in a cache or otherwise
In each experiment, we have varied a request size of each I/O call from 8 kbytes (typical size used by the standard I/O library stdio) to 256 kbytes. With PTL-N, while a thread issues an I/O call and waits for its completion, other threads can consume CPU time. In order to measure such CPU time (usable CPU time), we make two threads run in parallel so that while one (with higher priority) issues I/O calls, the other (with lower priority) performs infinite iteration. The longer the usable CPU time is, the more efficiently we can make use of I/O waiting time. In the experiment, we have used a SPARC Station 20 (SunOS4.1.4, 64MB RAM) for running applications and an IBM-PC compatible machine (Solaris2.5.1, MMX Pentium 166MHz, 64MB RAM) for a NFS server, where both machines are connected with each other via a 10 BASE/T Ethernet. Writing to a file on NFS An experimental result in writing a file on NFS (NFSwrite) is shown in Fig.3. In the figure, horizontal and vertical axes represent the data size of each I/O call and the time duration, respectively. PTL-N and Normal UNIX represents I/O time taken for the I/O by PTL-N and by non-threaded program, respectively. Usable CPU time is as mentioned above. PTL-N
must be notified when the I/O is completed. To notify I/O completion asynchronously, we choose a method that the I/O server interrupts the main process by sending UNIX signal. However, since the overhead of the signal interruption is high in UNIX, there is a possibility to deteriorate the performance if the main process is too much interrupted. Therefore in our implementation, the number of the signal issued by the I/O server is reduced. When main process send I/O request, interprocess context switch to the I/O server may occur. If the I/O request is processed in a short time , requested I/O has already completed when the main process is rescheduled (Fig.2). In such a case, the signal interruption can be omitted, since the main process can notice completion of the requested I/O immediately. We call this method signal optimization.
20,000 18,000 16,000 14,000
4.1. Portability and transparency
Usable CPU time
We have confirmed that PTL-N works completely on SunOS4.1.4, FreeBSD2.2.6 and Solaris2.5.1 (x86). This result shows our address space sharing technique is applicable to most UNIXes. Since we have integrated the technique into PTL-N without modifying semantics of system calls, existing application programs need not be modified at all.
4.2. I/O performance In order to evaluate I/O performance of PTL-N, we have measured time duration for reading/writing a 10 Mbytes file in the following cases: with PTL-N or without PTL-N (non-threaded program)
with either a file located on a local file system (UFS) or a file on a network file system (NFS)
If requested data is in a cache, I/O system call will be completed quickly.
I/O size unit (kbyte)
Figure 3. I/O time (NFS write) According to the figure, NFS-write time with PTL-N is comparable to that without PTL-N. Moreover, in PTL-N, at least 90% of the total I/O waiting time could be used by other threads. We have confirmed that the following cases have the same tendency to NFS-write: (1) writing a file on UFS and (2) reading from a file on NFS when its contents are not in a cache. Reading from a file on UFS We show an experimental result in reading a file on UFS (UFS-read). When the requested file is in cache (Fig. 4), the read system call just copies data from a cache to the memory. In this case, the usable CPU time nearly equals to 0. We have confirmed that reading from a file on NFS where its contents are in a cache also shows the same tendency. When the requested file is not in a cache, reading takes some time due to the physical I/O operation. Accordingly,
in this case, the waiting time can be used effectively by other threads, similarly to the case of NFS-write. Since UFS-read takes much less time than NFS-write, the time with PTL-N takes 2 to 2.5 times longer than without PTLN. However, with PTL-N, most of the I/O waiting time can be used effectively by other threads. 500 450
250 200 150
Usable CPU time
100 50 0 8
I/O size unit (kbyte)
Figure 4. I/O time (UFS read, cache hit)
4.3. The impact of copy and signal optimization We have investigated the effect of the copy optimization and the signal optimization, by comparing PTL-N with the following special versions of PTL-N: (1) PTL-N without the copy optimization. (it simulates using shared memory segment to transfer I/O data), and (2) PTL-N without the signal optimization. (the I/O server sends a completion signal whenever each I/O completes). Fig.5 depicts the result with UFS-read. The copy optimization is effective for all I/O sizes. The signal optimization is effective when the size of each I/O is small since the number of issued signals is increased if I/O size is small. 700 600
PTL-N (w/o copy optimization)
PTL-N (w/o signal optimization)
100 0 8
I/O size unit (kbyte)
Figure 5. Optimization effect of PTL-N
5. Conclusion In this paper, we have proposed a new technique for user level multi-thread mechanisms to avoid blocking of whole process caused by file I/O requests. We have also implemented PTL-N based on the technique. Through the experiments, it has been shown that while a thread issues a slow I/O system call (e.g., NFS-read/write and UFS-write)
and waits for its completion, the other threads can continue execution while waiting for the I/O completion. Our future work is to extend PTL-N to multi processor environments. For achieving this, we can use the proposed technique to share the same address space among multiple processes. Multiple processes can be executed in parallel on a multi processor environment and each process can be used as a virtual processor. So, we can assign threads to different processes where the threads can communicate with each other via the shared address space. Using multiple UNIX processes as virtual processor has already been proposed in  and . However, the former relies on non portable method to share address spaces. The latter can share only a part of address space. The above problems can be solved when the technique we have introduced to PTL-N is applied.
References  K. Abe, T. Matsuura, and K. Taniguchi. An implementation of portable lightweight process mechanism under BSD UNIX. Trans. of Information Processing Society of Japan, 36(2):296–303, 1995.  ISO/IEC. Information technology – Portable Operating System Interface (POSIX) – Part1: System Application Program Interface (API) [C Language]. IEEE, 1996.  R. Mechler. A portable real time threads environment. Master’s thesis, Dept. of Computer Science, Univ. of British Columbia, 1997.  F. Mueller. A library implementation of POSIX threads under UNIX. In Proc. the Winter 1993 USENIX Technical Conf., pages 29–41, 1993.  H. Oguma, A. Kaieda, H. Morimoto, M. Suzuki, and Y. Nakayama. A study of light-weight process library on SMP computers. Information Processing Society of Japan, Sig Notes, 97-PRO-16, 97(112):13–18, 1997.  S. Oikawa and H. Tokuda. User-level real-time threads. In Proc. 11th IEEE Workshop on Real-Time Operating Systems and Software. RTOSS ’94, pages 7–11, May 1994.  C. Provenzano. Pthreads. http://www.mit.edu:8001/people/ proven/pthreads.html.  C. Sakamoto, T. Miyazaki, M. Kuwayama, K. Saisho, and A. Fukuda. Design and implementation of Parallel Pthread Library PPL with prallelism and portability. Trans. of the Institute of Electronics, Information and Communication Engineers, J80-D-I(1):42–49, 1997.  H. Tatsumoto, K. Yasumoto, K. Abe, T. Higashino, T. Matsuura, and K. Taniguchi. Design and implementation of Timed LOTOS Compiler with real-time threads. In Proc. of Multimedia, Distributed Cooperative and Mobile Symposium (DICOMO’98), pages 555–562, 1998.  K. Yasumoto, T. Higashino, K. Abe, T. Matsuura, and K. Taniguch. A LOTOS compiler generating multi-threaded object codes. In Proc. of 8th IFIP International Conference on Formal Description Techniques (FORTE’95), Chapman & Hall, pages 271–286, Oct 1995.