论文部分内容阅读
Abstract: Nowadays multi-core processor platforms are widely used even in embedded devices. Providing debugging of multi-threaded embedded software is a more complicated problem in comparison with usual desktop platforms due to embedded platforms limitations. Embedded resources are enough to perform only pre-defined set of applications, but not for debugging. Most of all known debugging solutions for parallel applications are intended for desktops or high-performance computers, but not for embedded systems. Another problem is that most of debugging solutions don’t give any information on a system-wide application behavior. To solve these problems and help developers to debug their multi-threaded embedded applications is a subject of Thread Visualizer. This tool was developed in Samsung Research Center in Moscow and Samsung Advanced Institute of Technology. Thread Visualizer supports based on ARM architecture platforms and Linux OS.
Key words: Debug, multi-thread, embedded, stack, unwinding.
1. Introduction
Nowadays multi-core processor platforms are widely used even in embedded devices. Software complexity for these platforms is rising dramatically. The software becomes more complicated and multi-threaded. In some modern embedded applications, hundreds of threads are created and run simultaneously. The complexity of debugging of such applications rise, because different threads run on different cores, share resources, face with synchronization problems, race conditions, etc. Another problem is embedded platforms’ limitations. Embedded resources are enough to perform only pre-defined set of applications, but not for debugging. Available CPU resources are about 1-5%, available RAM is about several megabytes.
Most of all known debugging solutions for parallel applications are intended for desktops or high-performance computers, but not for embedded systems. Also, most of them don’t give any information on the system-wide application behavior. To solve these problems and help developers to debug their multi-threaded embedded applications is a subject of Thread Visualizer tool. It was developed in Samsung Research Center in Moscow and Samsung Advanced Institute of Technology. Thread Visualizer supports based on ARM platforms and Linux OS.
Thread Visualizer provides visualizing of hierarchy between main process and threads; synchronization dependencies; unique thread identifying including full backtrace from thread creation call, and other useful features. This essentially simplifies debugging of complex multi-threaded applications on embedded systems.
2. Thread Visualizer
Thread Visualizer is a tool for debugging of multi-threaded embedded applications. It supports ARM-based platforms and Linux OS. Its architecture has a target-host type that allows overcoming the embedded resources limitations. Lightweight target part collects data that describes application behavior and sends it to host through the network connection. All heavy-weight operations like data storage, analysis and visualization operate on host. For collecting data Thread Visualizer uses System-Wide Analyzer of Performance (SWAP) engine [1].
SWAP is a profiler and performance analyzer for embedded applications also developed in Samsung Research Center in Moscow and Samsung Advanced Institute of Technology. It is based on kprobes technique [2] and provides dynamic instrumentation of kernel and user-space functions. SWAP doesn’t require application’s source code modification or re-compilation.
Using SWAP engine, Thread Visualizer instruments necessary functions and collects the data on instrumented functions, such as function name, Process IDentifier (PID)/Thread IDentifier (TID) of a process, CPU number, on which function was executed, time stamp of function call, function arguments, etc. Then, processing of collected data and executed binary files and final visualization are performed.
Thread Visualizer provides visualizing of hierarchy between main process and threads; synchronization dependencies; unique thread identifying; source code mapping, timing view, statistics and other features.
Using Thread Visualizer developer can consider system-wide behavior of application, not only perform the number of specific operations on parallel threads, like conventional debuggers provide. Developer can see the main process, threads and synchronization objects of application and relations between them, such as hierarchy parent-child relations between processes and threads, synchronization dependencies. Via unique thread identifying, together with generally used in Linux number identifier, also including thread function name and full backtrace from thread creation point, Thread Visualizer provides full information, where and how every thread was created, including source code mapping.
Additionally, timing view feature provides visualization of the time line with segments of execution of instrumented functions for every thread. Statistics on calls of instrumented functions is provided for every thread.
Some modern embedded applications create hundreds of threads and synchronization objects. Thread Visualizer is extremely useful for analysis of such applications.
Thread Visualizer’s thread hierarchy, synchronization dependencies, thread identifying and source code mapping visualization are shown in Fig. 1.
Timing view visualization is shown in Fig. 2.
Detailed description of thread identifying feature of Thread Visualizer, development barriers, related to stack unwinding limitations and proposed solution, are given below.
3. Thread Identifying
3.1 Standard Linux Thread Identifying
In Linux, every thread has unique numerical identifier: Process IDentifier (PID)/Thread IDentifier (TID). But such identification is not informative because it gives no clue on source code of particular threads.
3.2 Thread Identifying in Intel Thread Profiler
To make thread identifying more transparent, Intel Thread Profiler [3] includes into the thread identifier the name of function which is started on thread creation(thread function name) together with PID/TID. For example, pthread_create POSIX API [4] accepts the address of thread function as the third argument. See an example of thread creation source code in Fig. 3.
But applications can create a lot of threads with the same thread function. In this case, the thread identifier doesn’t contain enough information for the thread identification. In the next section, let’s consider the solution of this problem.
3.3 Unique Thread Identifying in Thread Visualizer
To provide unique identifying of threads, let’s include into the thread identifier together with PID/TID and thread function name full call backtrace from thread creation point. To make it clear that what call backtrace is, look at C code example, shown in Fig. 4. In this example, full call backtrace from thread creation point is a chain of function symbolic names: func2, func1 and main. Including full call backtrace from thread creation point into the thread identifier gives full and unique information about every created thread.
Thread Visualizer uses SWAP engine to collect stack snapshots and registers values. Then it unwinds stack snapshots to restore full call backtraces from thread creation points.
Known stack unwinding methods, their limitations and Thread Visualizer’s method are described below.
4. Stack Unwinding Methods
Well-known stack unwinding methods are based on using of a frame pointer register [5] and binary file debug information on stack frame layout. But such methods have some limitations which are described below. To overcome them, a new method described below, is proposed in Thread Visualizer.
4.1 Method Based on Frame Pointer Register Using
Let’s consider the method based on frame pointer register using. It is the easiest and well-known method of call stack unwinding. To use this method application should be build by gcc/g++ compiler with-fno-omit-frame-pointer option. Any level of code optimization (options -O ... -O3) turns off that option and omits using of frame pointer.
The values of frame pointer register are stored in the stack frames at application’s execution. See the example of ARM assembly instructions of storing in stack and restoring of frame pointer and return address values:
Here, fp is a frame pointer register, lr is a link register (it stores return address of called function), and pc is a program counter register; the value of return address of called function is restored to it. List of consecutive values of return addresses is a backtrace.
An example of stack of executing process for code shown in Fig. 4 at the thread creation point with stored frame pointer values is shown in Fig. 5.
The stack unwinding code example for stack with stored frame pointer values is shown in Fig. 6.
Limitations of above mentioned method consist in storing of excess register values (frame pointer) that decrease application performance. This makes impossible using of such a method in embedded platforms. Other drawbacks of method are: in some
cases the application can’t be re-compiled; some components of application (e.g., libraries) omit the frame pointer using.
4.2 Method Based on Binary File Debug Information on Stack Frame Layout
Now let’s consider the method based on binary file debug information on stack frame layout. To use this approach in case of Linux OS and gcc/g++ compilers application should be built with –g option. Thus, the binary file will have .debug_frame DWARF debug section [6] which contains information on stack frame layout.
In this case, storing of frame pointer values is not necessary. Every stack frame is described by Canonical Frame Address (CFA)–the start address of a stack frame; base register–a register from which CFA offset is calculated (stack pointer in most cases, frame pointer, instruction pointer or others in rare cases) and offsets of the stored register values (link register, frame pointer, instruction pointer, etc.) from CFA.
An example of the stack of executing process for code shown in Fig. 4 at the thread creation point and DWARF information, essential for stack unwinding, is shown in Fig. 7.
Method for stack unwinding by using of such information on stack frame layout is also provided by DWARF.
This approach has some limitations, such as: in some cases binary file doesn’t have .debug_frame section
and it can’t be recompiled; some components (e.g., libraries) do not have .debug_frame section; .debug_frame (or its part) is corrupted.
Thread Visualizer uses .debug_frame-based approach when it’s possible and its own method when mentioned approach can’t be used due to listed above limitations.
4.3 Thread Visualizer’s Method
To provide stack unwinding, it’s needed to analyze stack (or copy of stack) frame by frame from stack pointer to stack start address (because stack grows towards lower addresses) and collect the return addresses of every frame. To do it the size of every frame and position of return address in every frame should be defined. Frame size can be defined by analysis of corresponding procedure’s code: finding of all cases of stack pointer decreasing and summing of decrease values.
The binary code of procedure usually includes three parts: a prologue, a body and an epilogue. The prologue begins from the start address of procedure. The prologue contains instructions that establish the stack frame size and store necessary register values (including return address) in stack frame. Thus, Thread Visualizer’s method is based on analysis of prologue code for procedures which do not have debug information on stack frame layout (or if such information is broken or can’t be used by other reasons).
According to the proposed method, the prologue instructions, that decrease the stack pointer (pushing to stack register values, allocating the space for local variables, etc.), are located and processed; frame size is calculated by summing of decrease values, thus frame start address can be defined. Instruction that stores the return address of called procedure is located, thus offset of the return address from frame start address can be located. Thus, backtrace as the array of return addresses can be formed.
The number of analyzed instructions of prologue should be set depending on CPU architecture equal or more than a maximum prologue size for used architecture. Strictly speaking, “prologue” is a compiler-dependent concept. If compiler doesn’t support prologue notation, some reasonable number of first instructions should be analyzed. Note that start address of last analyzed instruction must be less than address of last executed instruction when stack snapshot was made or execution of process was stopped for first procedure in the backtrace and less than respective return address for other procedures in the backtrace.
The start address of code of the first procedure can be found by known address of last executed instruction which is located inside the first procedure. The start code address of the next procedure can be found by return address for previous procedure which is inside the next procedure. Last procedure can be detected in the following ways: frame in stack snapshot is the last one or the return address of called procedure is out of text block of binary file. Start address of procedure code can be defined by address inside it by using of binary file’s block which contains information on procedures’ start addresses, sizes, symbolic names, etc.(.symtab section of binary file in ELF format).
And, finally, let’s review an example of analysis of procedure’s prologue with proposed method. See an example of a procedure’s prologue on ARM Assembly:
In given prologue example instructions that decrease the stack pointer are push {lr} which decreases stack pointer by 4 and sub sp sp, #12 which decreases stack pointer by 12. Then the total frame size is equal to 4 + 12 = 16. The instruction that stores the return address is push {lr} which defines return address position as a first word in the frame, where it can be found and read.
5. Results and Discussion
Thread Visualizer helps to debug multi-threaded embedded applications and lets developer know the system-wide behavior of application. Further direction of Thread Visualizer development is adding of new features, helpful for debugging of multi-threaded
applications, such as kernel threads visualizing, concurrency level checking, etc. Another direction is future research on backtracing techniques to extend backtracing approach to Thumb [7] code, other CPU architectures and compilers.
6. Conclusions
The Thread Visualizer’s method of call stack unwinding can be used for cases when frame pointer using is omitted; binary file doesn’t have debug information on stack frame layout; some components of binary file (e.g., libraries) do not have debug information; debug information or its part is corrupted or can’t be used by other reasons and application can’t be recompiled. In other words, above mentioned method allows stack unwinding without any debug information. It’s especially significant for usage for embedded applications.
References
[1] A.A. Gerenkov, E.A. Gorelkina, S.S. Grekhov, S.Y. Dianov, J. Jeong, O. Kokachev, L.V. Komkov, S.B. Lee, M.P. Levin, System-wide analyzer of performance: performance analysis of multi-core computing systems with limited resources, in: Proceedings of Eurocon 2009 International IEEE Conference Devoted to the 150-Anniversary of Alexander S. Popov, Saint-Petersburg, Russia, May 18-23, 2009, pp. 1302-1307.
[2] P. Panchamukhi, Kernel Debugging with Kprobes, Linux Technology Center, IBM India Software Labs, available online at: http://www.ibm.com/developerworks/library/lkprobes/index.html, Aug 19, 2004.
[3] Boost Performance Optimization and Multicore Scalability on Windows and Linux, available online at: http://software.intel.com/en-us/intel-vtune.
[4] pthread_create(3), Linux Man Page, available online at: http://linux.die.net/man/3/pthread_create.
[5] Call Stack, Frame Pointer Structure, Wikipedia, available online at: http://en.wikipedia.org/wiki/Frame_pointer#Str ucture.
[6] DWARF Debugging Format Standards (See Call Frame Information), available online at: http://www.dwarfstd.org/Download.php.
[7] ARM Architecture, Thumb, available online at: http://en.wikipedia.org/wiki/ARM_architecture#Thumb.
Key words: Debug, multi-thread, embedded, stack, unwinding.
1. Introduction
Nowadays multi-core processor platforms are widely used even in embedded devices. Software complexity for these platforms is rising dramatically. The software becomes more complicated and multi-threaded. In some modern embedded applications, hundreds of threads are created and run simultaneously. The complexity of debugging of such applications rise, because different threads run on different cores, share resources, face with synchronization problems, race conditions, etc. Another problem is embedded platforms’ limitations. Embedded resources are enough to perform only pre-defined set of applications, but not for debugging. Available CPU resources are about 1-5%, available RAM is about several megabytes.
Most of all known debugging solutions for parallel applications are intended for desktops or high-performance computers, but not for embedded systems. Also, most of them don’t give any information on the system-wide application behavior. To solve these problems and help developers to debug their multi-threaded embedded applications is a subject of Thread Visualizer tool. It was developed in Samsung Research Center in Moscow and Samsung Advanced Institute of Technology. Thread Visualizer supports based on ARM platforms and Linux OS.
Thread Visualizer provides visualizing of hierarchy between main process and threads; synchronization dependencies; unique thread identifying including full backtrace from thread creation call, and other useful features. This essentially simplifies debugging of complex multi-threaded applications on embedded systems.
2. Thread Visualizer
Thread Visualizer is a tool for debugging of multi-threaded embedded applications. It supports ARM-based platforms and Linux OS. Its architecture has a target-host type that allows overcoming the embedded resources limitations. Lightweight target part collects data that describes application behavior and sends it to host through the network connection. All heavy-weight operations like data storage, analysis and visualization operate on host. For collecting data Thread Visualizer uses System-Wide Analyzer of Performance (SWAP) engine [1].
SWAP is a profiler and performance analyzer for embedded applications also developed in Samsung Research Center in Moscow and Samsung Advanced Institute of Technology. It is based on kprobes technique [2] and provides dynamic instrumentation of kernel and user-space functions. SWAP doesn’t require application’s source code modification or re-compilation.
Using SWAP engine, Thread Visualizer instruments necessary functions and collects the data on instrumented functions, such as function name, Process IDentifier (PID)/Thread IDentifier (TID) of a process, CPU number, on which function was executed, time stamp of function call, function arguments, etc. Then, processing of collected data and executed binary files and final visualization are performed.
Thread Visualizer provides visualizing of hierarchy between main process and threads; synchronization dependencies; unique thread identifying; source code mapping, timing view, statistics and other features.
Using Thread Visualizer developer can consider system-wide behavior of application, not only perform the number of specific operations on parallel threads, like conventional debuggers provide. Developer can see the main process, threads and synchronization objects of application and relations between them, such as hierarchy parent-child relations between processes and threads, synchronization dependencies. Via unique thread identifying, together with generally used in Linux number identifier, also including thread function name and full backtrace from thread creation point, Thread Visualizer provides full information, where and how every thread was created, including source code mapping.
Additionally, timing view feature provides visualization of the time line with segments of execution of instrumented functions for every thread. Statistics on calls of instrumented functions is provided for every thread.
Some modern embedded applications create hundreds of threads and synchronization objects. Thread Visualizer is extremely useful for analysis of such applications.
Thread Visualizer’s thread hierarchy, synchronization dependencies, thread identifying and source code mapping visualization are shown in Fig. 1.
Timing view visualization is shown in Fig. 2.
Detailed description of thread identifying feature of Thread Visualizer, development barriers, related to stack unwinding limitations and proposed solution, are given below.
3. Thread Identifying
3.1 Standard Linux Thread Identifying
In Linux, every thread has unique numerical identifier: Process IDentifier (PID)/Thread IDentifier (TID). But such identification is not informative because it gives no clue on source code of particular threads.
3.2 Thread Identifying in Intel Thread Profiler
To make thread identifying more transparent, Intel Thread Profiler [3] includes into the thread identifier the name of function which is started on thread creation(thread function name) together with PID/TID. For example, pthread_create POSIX API [4] accepts the address of thread function as the third argument. See an example of thread creation source code in Fig. 3.
But applications can create a lot of threads with the same thread function. In this case, the thread identifier doesn’t contain enough information for the thread identification. In the next section, let’s consider the solution of this problem.
3.3 Unique Thread Identifying in Thread Visualizer
To provide unique identifying of threads, let’s include into the thread identifier together with PID/TID and thread function name full call backtrace from thread creation point. To make it clear that what call backtrace is, look at C code example, shown in Fig. 4. In this example, full call backtrace from thread creation point is a chain of function symbolic names: func2, func1 and main. Including full call backtrace from thread creation point into the thread identifier gives full and unique information about every created thread.
Thread Visualizer uses SWAP engine to collect stack snapshots and registers values. Then it unwinds stack snapshots to restore full call backtraces from thread creation points.
Known stack unwinding methods, their limitations and Thread Visualizer’s method are described below.
4. Stack Unwinding Methods
Well-known stack unwinding methods are based on using of a frame pointer register [5] and binary file debug information on stack frame layout. But such methods have some limitations which are described below. To overcome them, a new method described below, is proposed in Thread Visualizer.
4.1 Method Based on Frame Pointer Register Using
Let’s consider the method based on frame pointer register using. It is the easiest and well-known method of call stack unwinding. To use this method application should be build by gcc/g++ compiler with-fno-omit-frame-pointer option. Any level of code optimization (options -O ... -O3) turns off that option and omits using of frame pointer.
The values of frame pointer register are stored in the stack frames at application’s execution. See the example of ARM assembly instructions of storing in stack and restoring of frame pointer and return address values:
Here, fp is a frame pointer register, lr is a link register (it stores return address of called function), and pc is a program counter register; the value of return address of called function is restored to it. List of consecutive values of return addresses is a backtrace.
An example of stack of executing process for code shown in Fig. 4 at the thread creation point with stored frame pointer values is shown in Fig. 5.
The stack unwinding code example for stack with stored frame pointer values is shown in Fig. 6.
Limitations of above mentioned method consist in storing of excess register values (frame pointer) that decrease application performance. This makes impossible using of such a method in embedded platforms. Other drawbacks of method are: in some
cases the application can’t be re-compiled; some components of application (e.g., libraries) omit the frame pointer using.
4.2 Method Based on Binary File Debug Information on Stack Frame Layout
Now let’s consider the method based on binary file debug information on stack frame layout. To use this approach in case of Linux OS and gcc/g++ compilers application should be built with –g option. Thus, the binary file will have .debug_frame DWARF debug section [6] which contains information on stack frame layout.
In this case, storing of frame pointer values is not necessary. Every stack frame is described by Canonical Frame Address (CFA)–the start address of a stack frame; base register–a register from which CFA offset is calculated (stack pointer in most cases, frame pointer, instruction pointer or others in rare cases) and offsets of the stored register values (link register, frame pointer, instruction pointer, etc.) from CFA.
An example of the stack of executing process for code shown in Fig. 4 at the thread creation point and DWARF information, essential for stack unwinding, is shown in Fig. 7.
Method for stack unwinding by using of such information on stack frame layout is also provided by DWARF.
This approach has some limitations, such as: in some cases binary file doesn’t have .debug_frame section
and it can’t be recompiled; some components (e.g., libraries) do not have .debug_frame section; .debug_frame (or its part) is corrupted.
Thread Visualizer uses .debug_frame-based approach when it’s possible and its own method when mentioned approach can’t be used due to listed above limitations.
4.3 Thread Visualizer’s Method
To provide stack unwinding, it’s needed to analyze stack (or copy of stack) frame by frame from stack pointer to stack start address (because stack grows towards lower addresses) and collect the return addresses of every frame. To do it the size of every frame and position of return address in every frame should be defined. Frame size can be defined by analysis of corresponding procedure’s code: finding of all cases of stack pointer decreasing and summing of decrease values.
The binary code of procedure usually includes three parts: a prologue, a body and an epilogue. The prologue begins from the start address of procedure. The prologue contains instructions that establish the stack frame size and store necessary register values (including return address) in stack frame. Thus, Thread Visualizer’s method is based on analysis of prologue code for procedures which do not have debug information on stack frame layout (or if such information is broken or can’t be used by other reasons).
According to the proposed method, the prologue instructions, that decrease the stack pointer (pushing to stack register values, allocating the space for local variables, etc.), are located and processed; frame size is calculated by summing of decrease values, thus frame start address can be defined. Instruction that stores the return address of called procedure is located, thus offset of the return address from frame start address can be located. Thus, backtrace as the array of return addresses can be formed.
The number of analyzed instructions of prologue should be set depending on CPU architecture equal or more than a maximum prologue size for used architecture. Strictly speaking, “prologue” is a compiler-dependent concept. If compiler doesn’t support prologue notation, some reasonable number of first instructions should be analyzed. Note that start address of last analyzed instruction must be less than address of last executed instruction when stack snapshot was made or execution of process was stopped for first procedure in the backtrace and less than respective return address for other procedures in the backtrace.
The start address of code of the first procedure can be found by known address of last executed instruction which is located inside the first procedure. The start code address of the next procedure can be found by return address for previous procedure which is inside the next procedure. Last procedure can be detected in the following ways: frame in stack snapshot is the last one or the return address of called procedure is out of text block of binary file. Start address of procedure code can be defined by address inside it by using of binary file’s block which contains information on procedures’ start addresses, sizes, symbolic names, etc.(.symtab section of binary file in ELF format).
And, finally, let’s review an example of analysis of procedure’s prologue with proposed method. See an example of a procedure’s prologue on ARM Assembly:
In given prologue example instructions that decrease the stack pointer are push {lr} which decreases stack pointer by 4 and sub sp sp, #12 which decreases stack pointer by 12. Then the total frame size is equal to 4 + 12 = 16. The instruction that stores the return address is push {lr} which defines return address position as a first word in the frame, where it can be found and read.
5. Results and Discussion
Thread Visualizer helps to debug multi-threaded embedded applications and lets developer know the system-wide behavior of application. Further direction of Thread Visualizer development is adding of new features, helpful for debugging of multi-threaded
applications, such as kernel threads visualizing, concurrency level checking, etc. Another direction is future research on backtracing techniques to extend backtracing approach to Thumb [7] code, other CPU architectures and compilers.
6. Conclusions
The Thread Visualizer’s method of call stack unwinding can be used for cases when frame pointer using is omitted; binary file doesn’t have debug information on stack frame layout; some components of binary file (e.g., libraries) do not have debug information; debug information or its part is corrupted or can’t be used by other reasons and application can’t be recompiled. In other words, above mentioned method allows stack unwinding without any debug information. It’s especially significant for usage for embedded applications.
References
[1] A.A. Gerenkov, E.A. Gorelkina, S.S. Grekhov, S.Y. Dianov, J. Jeong, O. Kokachev, L.V. Komkov, S.B. Lee, M.P. Levin, System-wide analyzer of performance: performance analysis of multi-core computing systems with limited resources, in: Proceedings of Eurocon 2009 International IEEE Conference Devoted to the 150-Anniversary of Alexander S. Popov, Saint-Petersburg, Russia, May 18-23, 2009, pp. 1302-1307.
[2] P. Panchamukhi, Kernel Debugging with Kprobes, Linux Technology Center, IBM India Software Labs, available online at: http://www.ibm.com/developerworks/library/lkprobes/index.html, Aug 19, 2004.
[3] Boost Performance Optimization and Multicore Scalability on Windows and Linux, available online at: http://software.intel.com/en-us/intel-vtune.
[4] pthread_create(3), Linux Man Page, available online at: http://linux.die.net/man/3/pthread_create.
[5] Call Stack, Frame Pointer Structure, Wikipedia, available online at: http://en.wikipedia.org/wiki/Frame_pointer#Str ucture.
[6] DWARF Debugging Format Standards (See Call Frame Information), available online at: http://www.dwarfstd.org/Download.php.
[7] ARM Architecture, Thumb, available online at: http://en.wikipedia.org/wiki/ARM_architecture#Thumb.