Sunday, August 9, 2015

Linux Kernel Debugging using KGDB and User space process debugging using gdb

Linux Kernel Debugging using KGDB:

Kernel can be debugged using kgdb via serial port , virtual machine and via network configuration.

We required 2 systems to debug kernel using kgdb. Lets see how to debug using serial port connection.

Target system : Developing kernel to be run on this machine.

Host system :  Using gdb on host machine we can debug kernel running on target machine.

Make sure KGDB related config is enabled .
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y

Add boot parameters as "kgdbwait kgdboc=ttyS4,115200" in target system. For more clarity on adding boot parameters refer "http://ayyappa-ch.blogspot.in/2015/07/serial-console-logging.html"

Need to boot the target machine with mentioned boot parameters.


When i try to debug , after this screen target machine is not responding. So the actual limitation is KGDB will not support usb based key board. So , out target will support only USB based keyboard. So , we modified boot parameters as "console=ttyS4,115200n8 kgdbwait kgdboc=ttyS4".

This configuration will help COM port as console. Host and Target are connected via serial port.
On host serial we could see same message and able to enter to kdb mode. So now target is in waiting state for Host to connect via gdb. After target enters into kdb mode we need to close serial port terminal as we are going to use same serial port to debug the kernel.

Start the host machine and make sure source , map , vmlinux and other obj files available in host machine. Go to the location where vmlinux exist.

$ gdb ./vmlinux

$ gdb > set debug serial 1

$ gdb > break emmc_init -> set break point at emmc_init

$ gdb> target remote /dev/ttyS4
This will make the connection between host and target mechine. After this control comes to Host mechine and Target waits on host.

$gdb> cont
This will continue the boot process in target and control comes to target . On host terminal will be waiitng.

On target machine enter  $  echo "g"  >  /proc/sysrq-trigger.
This will give control to host machine to continue debugging.

Continue debugging as usual with gdb.
$gdb> bt - gives back trace of the kernel
$gdb> cont
This will continue until break point hits or if no break point hits , control goes to target machine.

Again in target mechine enter $  echo "g"  >  /proc/sysrq-trigger.
This will again give control to host machine . So we can debug again.

We can use either Linux machine or Windows machine as Host system. If we use windows system use MinGW gdb at windows side

Kernel module debugging:
On target machine get required module information:
 $ modprobe  your_module
$cd /sys/module/your_module/sections
$ cat .text
$ cat .bss
$ cat .data
$ cat .rodata

On host machine add module for debugging:
 $ gdb> add-symbol-file  module_name_with complete_path  \
 text_segment_address \
-s .bss    bss_segment_address \
-s .rodata   rodata_segment_address \
-s .data    data_segment_address

Now we can set break points even in  our modules.

User Space Process debugging:
we will see how to debug Xorg process.

$ ps -e | grep Xorg - get Xorg process pid

$ sudo gdb -p (pidof Xorg) - connect gdb to Xorg process

$gdb > bt -> will get stack trace of Xorg

Below link has some good info about gdb list and disassemble commands examples.

https://wiki.ubuntu.com/Kernel/KernelDebuggingTricks

gdb debugging commands:

step - step into sub-routine

next - run over sub-routine in a go

finish - run till current function returns

return - make selected stack frame return to its caller

jump [address] - continue program at specified line or address

list  - print 10 more lines

info break - show list and status of break points

break function - setting break point at function

break file:line - setting a break point at a file of line .

bt - display back trace

info args - shows args of current frame

info locals - shows locals of current frame

thread apply all bt  - Back trace of all the threads

info threads - information about threads









Friday, July 24, 2015

Serial Console logging

To debug boot-up issues we need to configure Serial port as console in Kernel . We can easily collect dmesg information with this.

Steps to enable Serial logging :

1) Make sure serial driver statically built with kernel image . If it built as a module , we can not get bootup logs as it will not loaded at that time. If it is built as modules , make sure module name is added to /etc/modules.conf. So that module will be loaded .

2) Need to add kernel boot parameters for enabling serial port as console

Edit /etc/default/grub file

Replace  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"  with

GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS4,115200n8 debug ignore_loglevel"

close the file and execute "sudo update-grub".

Make sure target system UART is connected to other system and configure UART.

Reboot the system and observe the logs will be displayed in both test system and Other system.

console=tty0  =>  the kernel to continue using the standard console device. Boot messges will be dispayed on target system monitor.

console=ttyS4 => the kernel to also use the first serial device as a console. ttySn , n will be the serial port device available.  grep dmesg with "ttyS" to find available ports in your system.

115200n8 sets the serial speed to 115200 bits per second with no parity and 8 data bits per character. The other end of the serial line also must be configured to same settings.

debug ignore_loglevel allows you to see ALL kernel messages on the configured console.

drm.debug=14 => if we need to enable drm specifc logs other than all logs.

log_buf_len=16M => set the dmesg buffer size to get all required logs.




Monday, July 13, 2015

extern Vs EXPORT_SYMBOL ?

EXPORT_SYMBOL macro creates a new kernel symbol entry and puts it in a special section of the kernel image, in __ksymtab section. When modules are loaded dynamically, he loader resolves the symbols during run time by parsing the entries of this section.

For exporting a symbol , Function prototypes don't need to be written with extern. extern is assumed by the compiler.  Its use is as unnecessary as using auto to declare automatic/local variables in a block.
 
Extern can be used for non-static functions that are statically linked during compile time.

Monday, July 6, 2015

Why Serial Data Transmission Is Faster Than Parallel Data Transmission?

how serial data transmission can be faster than parallel data transmission ? But lets see what it means..

To be with basics , At a given frequency Parallel data transmission is better than Serial data transmission. With parallel transmission , for one clock cycle we can transfer 1 byte but only one bit with serial data transmission.

But we can not use parallel data transmission at high frequency. High frequency will cause data lost , data corruption and synchronization issue . So it may need to transmit the  data and finally data rate will be limited.

If we use high frequency for serial data transmission , we can achieve higher data rate because data will be transmitted at only one line . so there is no synchronization and stability issues.

This is the reason SATA AHCI interface makes faster than PATA.  In latest technology we use only SATA AHCI.

SATA can run at lower power compared to PATA due to no of pins are less.


So , latest serial data transmission  can be called it as High Speed Serial Data Transmission .

One more example is PCIe is faster than PCI.

PCIe is a high speed serial computer expansion bus.


Friday, July 3, 2015

IOMMU - I/O Memory Management Unit

IOMMU stands for I/O memory management Unit.  It connects a DMA capable I/O  bus to the system memory. It provides memory protection and address translations for I/O devices. This address translation is implemented in paging based. Some architectures allow interrupt remapping in a manner generally similar to address translation.



IOMMU is controlled by system rather than device. As control is with OS , faulty device can not corrupt memory and also provide memory protection.

IOMMU is transparent to devices and their drivers.

IOMMU is used to enhance virtualized I/O performance.

Large regions of memory can be allocated without the need to be contiguous in physical memory. IOMMU will take care of mapping contiguous virtual addresses to fragmented physical addresses.

For devices that do not support memory addresses long enough to address the entire physical memory, the device can still address the entire memory through the IOMMU. This avoids overhead associated with buffer copies ( bounce buffer) to and from the memory space the peripheral can address.

Virtualized guest operating systems can safely be granted direct access to hardware. (Device pass-through).

Guest Virtual Address -> Guest Physical Address -> System Physical Address.
Guest Physical address is same as System Virtual address.

Device pass-through : This is the ability to directly assign a physical device to a particular guest OS. The required address space translation is handled transparently. Ideally a device’ address space is the same as a guest’s physical address space; however, in the virtualized case this is hard to achieve without an IOMMU. IOMMU is designed in such a way to  remap Guest Virtual address to Device I/O address in guest operating system. It allows the Guest OS to direct access to Hardware to improve performance. If we do not use IOMMU , guest physical address needs to remap to host physical address using Host OS intervence but it will delay the process. So directly providing access to VM will improve the performance.

Remapping of interrupts: Usually sharing device interrupts among several guests is complicated to handle. IOMMU provides a basis to separate device interrupts that are already shared by different devices. It remaps a shared interrupt to an exclusive vector to ease up its delivery to a particular guest OS.

Disadvantages:
I/O page tables need some amount of physical memory.
I/O translation can not be avoided.






Note : i just try to put my understanding on Guest OS to Host OS interaction with IOMMU in my words. Thanks to google for all the information.

















Tuesday, June 2, 2015

Basic Git commands

Basic Git commands:
--------------------------

git branch -av -> list all available branches


git branch -D branch_name -> to delete the branch locally


git checkout master -> switch the branch to master


git init - Initialized empty Git repository ...


git add . - add all files


git commit -m 'Initial commit.' - commit the  changes


git checkout -b test - create a new branch test


git branch test  -  Creates test branch  but branch will not be changes to test


git checkout test -  checkout to test and move to test branch


git checkout -f commitid  -> ignore local changes any and revert to commit id specified.


git clean -fd  -> Removing Untracked files


git checkout -f  -> Removing uncommitted tracked files 


Generating patch using git:

-----------------------------------

git format-patch master --stdout  >  fix_empty_poster.patch


=> This will create a new file fix_empty_poster.patch with all changes from the current branch  against master.



git format-patch -1 commit-id    =>  generating the patch using commit id,



Patch commands:

-----------------------

patch -p1 < patch_file -> applying the patch


patch -p1 --dry-run < patch_file -> view how patch is applying . but it will not apply the patch


patch -p1 -R < patch_file - reverting the patch



Applying the patch using git:

-----------------------------
git apply --stat file.patch =>  show stats.

git apply --check file.patch =>  check for error before applying.


git am < file.patch => apply the patch finally.



Git am VS patch command:

-------------------------------------

Patch will just apply code changes and later we need to commit the changes but "Git am" will apply code changes and also commit the patch as it can extract author information. 


Undoing Changes:

--------------------------
git revert <commit>

Generate a new commit that undoes all of the changes introduced in <commit>, then apply it to the current branch.


git reset --hard <commit>

Move the current branch tip backward to <commit> and reset both the staging area and the working directory to match. This obliterates not only the uncommitted changes, but all commits after <commit>, as well.

git clean -f

Remove untracked files from the current directory.

https://www.atlassian.com/git/tutorials/undoing-changes/git-checkout



Debugging with GIT:

---------------------------

Git bisect  - The bisect command does a binary search through your commit history to help you identify as quickly as possible which commit introduced an issue.


Git blame - you can annotate the file with git blame to see when each line of the method was last edited and by whom.


https://git-scm.com/book/en/v2/Git-Tools-Debugging-with-Git



To know the URL of the local GIT repositery:           git remote show origin

Basic Git commands

Basic Git commands:
--------------------------

git branch -av -> list all available branches


git branch -D branch_name -> to delete the branch locally


git checkout master -> switch the branch to master


git init - Initialized empty Git repository ...


git add . - add all files


git commit -m 'Initial commit.' - commit the  changes


git checkout -b test - create a new branch test


git branch test  -  Creates test branch  but branch will not be changes to test


git checkout test -  checkout to test and move to test branch


git checkout -f commitid  -> ignore local changes any and revert to commit id specified.


git clean -fd  -> Removing Untracked files


git checkout -f  -> Removing uncommitted tracked files 


Generating patch using git:

-----------------------------------

git format-patch master --stdout  >  fix_empty_poster.patch


=> This will create a new file fix_empty_poster.patch with all changes from the current branch  against master.



git format-patch -1 commit-id    =>  generating the patch using commit id,



Patch commands:

-----------------------

patch -p1 < patch_file -> applying the patch


patch -p1 --dry-run < patch_file -> view how patch is applying . but it will not apply the patch


patch -p1 -R < patch_file - reverting the patch



Applying the patch using git:

-----------------------------
git apply --stat file.patch =>  show stats.

git apply --check file.patch =>  check for error before applying.


git am < file.patch => apply the patch finally.



Git am VS patch command:

-------------------------------------

Patch will just apply code changes and later we need to commit the changes but "Git am" will apply code changes and also commit the patch as it can extract author information. 


Undoing Changes:

--------------------------
git revert <commit>

Generate a new commit that undoes all of the changes introduced in <commit>, then apply it to the current branch.


git reset --hard <commit>

Move the current branch tip backward to <commit> and reset both the staging area and the working directory to match. This obliterates not only the uncommitted changes, but all commits after <commit>, as well.

git clean -f

Remove untracked files from the current directory.

https://www.atlassian.com/git/tutorials/undoing-changes/git-checkout



Debugging with GIT:

---------------------------

Git bisect  - The bisect command does a binary search through your commit history to help you identify as quickly as possible which commit introduced an issue.


Git blame - you can annotate the file with git blame to see when each line of the method was last edited and by whom.


https://git-scm.com/book/en/v2/Git-Tools-Debugging-with-Git



To know the URL of the local GIT repositery:           git remote show origin

Sunday, May 24, 2015

RT Linux - PREEMPT_RT_FULL Support on Arm64

Please find below patch to add PREEMPT_RT_FULL support on Arm64 based system with base kernel as 3.18.9-rt5 kernel

Below patch contains :
1) PREEMPT_RT_FULL support
2) Support for 32bit apps on 64 bit kernel ( discussed in my previous post)
3) 64k page support for 32bit apps.


From 983e3630e3058b2deac4be17f3b8ee9dc6769f41 Mon Sep 17 00:00:00 2001
From: Ayyappa Ch <ayyappa.chandolu@amd.com>
Date: Tue, 18 Apr 2015 12:33:28 +0530
Subject: [PATCH] RT patch for AMD platform on top of 3.18.9-rt5 RT Kernel

---
 arch/arm64/Kconfig                   |   7 +-
 arch/arm64/Makefile                  |   2 +-
 arch/arm64/include/asm/cmpxchg.h     |   2 +
 arch/arm64/include/asm/thread_info.h |   3 +
 arch/arm64/include/asm/unistd.h      |   8 ++-
 arch/arm64/include/asm/unistd32.h    |   2 +-
 arch/arm64/kernel/Makefile           |   2 +-
 arch/arm64/kernel/asm-offsets.c      |   1 +
 arch/arm64/kernel/entry.S            |  18 ++++-
 arch/arm64/kernel/entry32.S          | 123 +++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/head.S             |   2 +-
 arch/arm64/kernel/process.c          |  41 ++++++++++++
 arch/arm64/kernel/sys.c              |   4 +-
 arch/arm64/kernel/sys32.S            | 115 --------------------------------
 arch/arm64/kernel/sys32.c            |  57 ++++++++++++++++
 arch/arm64/mm/fault.c                |   4 +-
 include/linux/compat.h               |  12 ++++
 include/linux/syscalls.h             |   8 ++-
 18 files changed, 280 insertions(+), 131 deletions(-)
 create mode 100644 arch/arm64/kernel/entry32.S
 delete mode 100644 arch/arm64/kernel/sys32.S
 create mode 100644 arch/arm64/kernel/sys32.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9532f8d..765b1e2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -58,9 +58,11 @@ config ARM64
  select HAVE_PERF_EVENTS
  select HAVE_PERF_REGS
  select HAVE_PERF_USER_STACK_DUMP
+ select HAVE_PREEMPT_LAZY
  select HAVE_RCU_TABLE_FREE
  select HAVE_SYSCALL_TRACEPOINTS
  select IRQ_DOMAIN
+ select IRQ_FORCED_THREADING
  select MODULES_USE_ELF_RELA
  select NO_BOOTMEM
  select OF
@@ -217,6 +219,7 @@ endchoice
 choice
  prompt "Virtual address space size"
  default ARM64_VA_BITS_39 if ARM64_4K_PAGES
+ default ARM64_VA_BITS_48 if ARM64_4K_PAGES
  default ARM64_VA_BITS_42 if ARM64_64K_PAGES
  help
   Allows choosing one of multiple possible virtual address
@@ -233,7 +236,7 @@ config ARM64_VA_BITS_42

 config ARM64_VA_BITS_48
  bool "48-bit"
- depends on !ARM_SMMU
+ depends on ARM64_4K_PAGES

 endchoice

@@ -298,6 +301,7 @@ config HOTPLUG_CPU
   Say Y here to experiment with turning CPUs off and on.  CPUs
   can be controlled through /sys/devices/system/cpu.

+
 source kernel/Kconfig.preempt

 config HZ
@@ -409,7 +413,6 @@ source "fs/Kconfig.binfmt"

 config COMPAT
  bool "Kernel support for 32-bit EL0"
- depends on !ARM64_64K_PAGES
  select COMPAT_BINFMT_ELF
  select HAVE_UID16
  select OLD_SIGSUSPEND3
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 20901ff..2984d9c 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -39,7 +39,7 @@ head-y := arch/arm64/kernel/head.o
 ifeq ($(CONFIG_ARM64_RANDOMIZE_TEXT_OFFSET), y)
 TEXT_OFFSET := $(shell awk 'BEGIN {srand(); printf "0x%03x000\n", int(512 * rand())}')
 else
-TEXT_OFFSET := 0x00080000
+TEXT_OFFSET := 0x0080000
 endif

 export TEXT_OFFSET GZFLAGS
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index ddb9d78..2a64433 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -79,6 +79,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
  __ret; \
 })

+#define __HAVE_ARCH_CMPXCHG 1
+
 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
       unsigned long new, int size)
 {
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 459bf8e..8e9704f 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -50,6 +50,7 @@ struct thread_info {
  struct exec_domain *exec_domain; /* execution domain */
  struct restart_block restart_block;
  int preempt_count; /* 0 => preemptable, <0 => bug */
+ int preempt_lazy_count; /* 0 => preemptable, <0 => bug */
  int cpu; /* cpu */
 };

@@ -108,6 +109,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NEED_RESCHED 1
 #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */
 #define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */
+#define TIF_NEED_RESCHED_LAZY 4
 #define TIF_NOHZ 7
 #define TIF_SYSCALL_TRACE 8
 #define TIF_SYSCALL_AUDIT 9
@@ -124,6 +126,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_NOHZ (1 << TIF_NOHZ)
 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 6d2bf41..3bc498c 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -31,6 +31,9 @@
  * Compat syscall numbers used by the AArch64 kernel.
  */
 #define __NR_compat_restart_syscall 0
+#define __NR_compat_exit 1
+#define __NR_compat_read 3
+#define __NR_compat_write 4
 #define __NR_compat_sigreturn 119
 #define __NR_compat_rt_sigreturn 173

@@ -41,10 +44,13 @@
 #define __ARM_NR_compat_cacheflush (__ARM_NR_COMPAT_BASE+2)
 #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE+5)

-#define __NR_compat_syscalls 386
+#define __NR_compat_syscalls 388
 #endif

 #define __ARCH_WANT_SYS_CLONE
+
+#ifndef __COMPAT_SYSCALL_NR
 #include <uapi/asm/unistd.h>
+#endif

 #define NR_syscalls (__NR_syscalls)
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 9dfdac4..2f163ba 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -406,7 +406,7 @@ __SYSCALL(__NR_vfork, sys_vfork)
 #define __NR_ugetrlimit 191 /* SuS compliant getrlimit */
 __SYSCALL(__NR_ugetrlimit, compat_sys_getrlimit) /* SuS compliant getrlimit */
 #define __NR_mmap2 192
-__SYSCALL(__NR_mmap2, sys_mmap_pgoff)
+__SYSCALL(__NR_mmap2, compat_sys_mmap2_wrapper)
 #define __NR_truncate64 193
 __SYSCALL(__NR_truncate64, compat_sys_truncate64_wrapper)
 #define __NR_ftruncate64 194
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 5bd029b..50f1fe1 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -18,7 +18,7 @@ arm64-obj-y := cputable.o debug-monitors.o entry.o irq.o fpsimd.o \
    cpuinfo.o

 arm64-obj-$(CONFIG_COMPAT) += sys32.o kuser32.o signal32.o \
-   sys_compat.o
+   sys_compat.o entry32.o
 arm64-obj-$(CONFIG_FUNCTION_TRACER) += ftrace.o entry-ftrace.o
 arm64-obj-$(CONFIG_MODULES) += arm64ksyms.o module.o
 arm64-obj-$(CONFIG_SMP) += smp.o smp_spin_table.o topology.o
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 9a9fce0..f774136 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -36,6 +36,7 @@ int main(void)
   BLANK();
   DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
   DEFINE(TI_PREEMPT, offsetof(struct thread_info, preempt_count));
+  DEFINE(TI_PREEMPT_LAZY, offsetof(struct thread_info, preempt_lazy_count));
   DEFINE(TI_ADDR_LIMIT, offsetof(struct thread_info, addr_limit));
   DEFINE(TI_TASK, offsetof(struct thread_info, task));
   DEFINE(TI_EXEC_DOMAIN, offsetof(struct thread_info, exec_domain));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 726b910..303d756 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -349,7 +349,14 @@ el1_irq:
  ldr w24, [tsk, #TI_PREEMPT] // get preempt count
  cbnz w24, 1f // preempt count != 0
  ldr x0, [tsk, #TI_FLAGS] // get flags
- tbz x0, #TIF_NEED_RESCHED, 1f // needs rescheduling?
+ tbz x0, #TIF_NEED_RESCHED, loop2 // needs rescheduling?
+ bl el1_preempt
+
+loop2:
+ ldr w24, [tsk, #TI_PREEMPT_LAZY] // get preempt lazy count
+ cbnz w24, 1f // preempt lazy count != 0
+
+ tbz x0, #_TIF_NEED_RESCHED_LAZY, 1f // needs rescheduling?
  bl el1_preempt
 1:
 #endif
@@ -365,6 +372,7 @@ el1_preempt:
 1: bl preempt_schedule_irq // irq en/disable is done inside
  ldr x0, [tsk, #TI_FLAGS] // get new tasks TI_FLAGS
  tbnz x0, #TIF_NEED_RESCHED, 1b // needs rescheduling?
+ tbnz x0, #_TIF_NEED_RESCHED_LAZY, 1b // needs rescheduling as per lazy?
  ret x24
 #endif

@@ -433,7 +441,7 @@ el0_svc_compat:
  /*
  * AArch32 syscall handling
  */
- adr stbl, compat_sys_call_table // load compat syscall table pointer
+ adrp stbl, compat_sys_call_table // load compat syscall table pointer
  uxtw scno, w7 // syscall number in w7 (r7)
  mov     sc_nr, #__NR_compat_syscalls
  b el0_svc_naked
@@ -601,8 +609,12 @@ fast_work_pending:
  str x0, [sp, #S_X0] // returned x0
 work_pending:
  tbnz x1, #TIF_NEED_RESCHED, work_resched
+ ldr x2, [tsk, #TI_PREEMPT_LAZY] // get preempt lazy count
+ cbnz x2, loop3 // preempt lazy count != 0
+
+ tbnz x1, #_TIF_NEED_RESCHED_LAZY, work_resched // needs rescheduling?
  /* TIF_SIGPENDING, TIF_NOTIFY_RESUME or TIF_FOREIGN_FPSTATE case */
- ldr x2, [sp, #S_PSTATE]
+loop3: ldr x2, [sp, #S_PSTATE]
  mov x0, sp // 'regs'
  tst x2, #PSR_MODE_MASK // user mode regs?
  b.ne no_work_pending // returning to kernel
diff --git a/arch/arm64/kernel/entry32.S b/arch/arm64/kernel/entry32.S
new file mode 100644
index 0000000..17f3296
--- /dev/null
+++ b/arch/arm64/kernel/entry32.S
@@ -0,0 +1,123 @@
+/*
+ * Compat system call wrappers
+ *
+ * Copyright (C) 2012 ARM Ltd.
+ * Authors: Will Deacon <will.deacon@arm.com>
+ *    Catalin Marinas <catalin.marinas@arm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+
+#include <asm/assembler.h>
+#include <asm/asm-offsets.h>
+#include <asm/errno.h>
+#include <asm/page.h>
+
+/*
+ * System call wrappers for the AArch32 compatibility layer.
+ */
+
+ENTRY(compat_sys_sigreturn_wrapper)
+ mov x0, sp
+ mov x27, #0 // prevent syscall restart handling (why)
+ b compat_sys_sigreturn
+ENDPROC(compat_sys_sigreturn_wrapper)
+
+ENTRY(compat_sys_rt_sigreturn_wrapper)
+ mov x0, sp
+ mov x27, #0 // prevent syscall restart handling (why)
+ b compat_sys_rt_sigreturn
+ENDPROC(compat_sys_rt_sigreturn_wrapper)
+
+ENTRY(compat_sys_statfs64_wrapper)
+ mov w3, #84
+ cmp w1, #88
+ csel w1, w3, w1, eq
+ b compat_sys_statfs64
+ENDPROC(compat_sys_statfs64_wrapper)
+
+ENTRY(compat_sys_fstatfs64_wrapper)
+ mov w3, #84
+ cmp w1, #88
+ csel w1, w3, w1, eq
+ b compat_sys_fstatfs64
+ENDPROC(compat_sys_fstatfs64_wrapper)
+
+/*
+ * Note: off_4k (w5) is always units of 4K.  If we can't do the requested
+ * offset, we return EINVAL.
+ */
+#if PAGE_SHIFT > 12
+ENTRY(compat_sys_mmap2_wrapper)
+ tst w5, #~PAGE_MASK >> 12
+ b.ne 1f
+ lsr w5, w5, #PAGE_SHIFT - 12
+ b sys_mmap_pgoff
+1: mov x0, #-EINVAL
+ ret lr
+ENDPROC(compat_sys_mmap2_wrapper)
+#endif
+
+/*
+ * Wrappers for AArch32 syscalls that either take 64-bit parameters
+ * in registers or that take 32-bit parameters which require sign
+ * extension.
+ */
+ENTRY(compat_sys_pread64_wrapper)
+ regs_to_64 x3, x4, x5
+ b sys_pread64
+ENDPROC(compat_sys_pread64_wrapper)
+
+ENTRY(compat_sys_pwrite64_wrapper)
+ regs_to_64 x3, x4, x5
+ b sys_pwrite64
+ENDPROC(compat_sys_pwrite64_wrapper)
+
+ENTRY(compat_sys_truncate64_wrapper)
+ regs_to_64 x1, x2, x3
+ b sys_truncate
+ENDPROC(compat_sys_truncate64_wrapper)
+
+ENTRY(compat_sys_ftruncate64_wrapper)
+ regs_to_64 x1, x2, x3
+ b sys_ftruncate
+ENDPROC(compat_sys_ftruncate64_wrapper)
+
+ENTRY(compat_sys_readahead_wrapper)
+ regs_to_64 x1, x2, x3
+ mov w2, w4
+ b sys_readahead
+ENDPROC(compat_sys_readahead_wrapper)
+
+ENTRY(compat_sys_fadvise64_64_wrapper)
+ mov w6, w1
+ regs_to_64 x1, x2, x3
+ regs_to_64 x2, x4, x5
+ mov w3, w6
+ b sys_fadvise64_64
+ENDPROC(compat_sys_fadvise64_64_wrapper)
+
+ENTRY(compat_sys_sync_file_range2_wrapper)
+ regs_to_64 x2, x2, x3
+ regs_to_64 x3, x4, x5
+ b sys_sync_file_range2
+ENDPROC(compat_sys_sync_file_range2_wrapper)
+
+ENTRY(compat_sys_fallocate_wrapper)
+ regs_to_64 x2, x2, x3
+ regs_to_64 x3, x4, x5
+ b sys_fallocate
+ENDPROC(compat_sys_fallocate_wrapper)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 0a6e4f9..34abf21 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -43,7 +43,7 @@
 #elif (PAGE_OFFSET & 0x1fffff) != 0
 #error PAGE_OFFSET must be at least 2MB aligned
 #elif TEXT_OFFSET > 0x1fffff
-#error TEXT_OFFSET must be less than 2MB
+#error TEXT_OFFSET must be less than 2MB
 #endif

  .macro pgtbl, ttb0, ttb1, virt_to_phys
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index fde9923..1a3ac52 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -368,6 +368,47 @@ unsigned long arch_align_stack(unsigned long sp)
  return sp & ~0xf;
 }

+/*
+ * CONFIG_SPLIT_PTLOCK_CPUS results in a page->ptl lock.  If the lock is not
+ * initialized by pgtable_page_ctor() then a coredump of the vector page will
+ * fail.
+ */
+
+static int __init vectors_user_mapping_init_page(void)
+{
+        pgd_t *pgd;
+        pud_t *pud;
+        pmd_t *pmd;
+        struct page *page;
+ unsigned long addr = UL(0xffffffffffffffff);
+
+        if ((((long)addr) >> VA_BITS) != -1UL)
+                return 0;
+
+        pgd = pgd_offset_k(addr);
+        if (pgd_none(*pgd))
+                return 0;
+
+        pud = pud_offset(pgd, addr);
+        if (pud_none(*pud))
+                return 0;
+
+        if (pud_sect(*pud))
+                return pfn_valid(pud_pfn(*pud));
+
+        pmd = pmd_offset(pud, addr);
+        if (pmd_none(*pmd))
+                return 0;
+
+        page = pmd_page(*(pmd));
+
+ pgtable_page_ctor(page);
+
+        return 0;
+}
+late_initcall(vectors_user_mapping_init_page);
+
+
 static unsigned long randomize_base(unsigned long base)
 {
  unsigned long range_end = base + (STACK_RND_MASK << PAGE_SHIFT) + 1;
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
index 3fa98ff..b2a6508 100644
--- a/arch/arm64/kernel/sys.c
+++ b/arch/arm64/kernel/sys.c
@@ -39,9 +39,9 @@ asmlinkage long sys_mmap(unsigned long addr, unsigned long len,
 /*
  * Wrappers to pass the pt_regs argument.
  */
+asmlinkage long sys_rt_sigreturn_wrapper(void);
 #define sys_rt_sigreturn sys_rt_sigreturn_wrapper

-#include <asm/syscalls.h>

 #undef __SYSCALL
 #define __SYSCALL(nr, sym) [nr] = sym,
@@ -50,7 +50,7 @@ asmlinkage long sys_mmap(unsigned long addr, unsigned long len,
  * The sys_call_table array must be 4K aligned to be accessible from
  * kernel/entry.S.
  */
-void *sys_call_table[__NR_syscalls] __aligned(4096) = {
+void * const sys_call_table[__NR_syscalls] __aligned(4096) = {
  [0 ... __NR_syscalls - 1] = sys_ni_syscall,
 #include <asm/unistd.h>
 };
diff --git a/arch/arm64/kernel/sys32.S b/arch/arm64/kernel/sys32.S
deleted file mode 100644
index 423a5b3..0000000
--- a/arch/arm64/kernel/sys32.S
+++ /dev/null
@@ -1,115 +0,0 @@
-/*
- * Compat system call wrappers
- *
- * Copyright (C) 2012 ARM Ltd.
- * Authors: Will Deacon <will.deacon@arm.com>
- *    Catalin Marinas <catalin.marinas@arm.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program.  If not, see <http://www.gnu.org/licenses/>.
- */
-
-#include <linux/linkage.h>
-
-#include <asm/assembler.h>
-#include <asm/asm-offsets.h>
-
-/*
- * System call wrappers for the AArch32 compatibility layer.
- */
-
-compat_sys_sigreturn_wrapper:
- mov x0, sp
- mov x27, #0 // prevent syscall restart handling (why)
- b compat_sys_sigreturn
-ENDPROC(compat_sys_sigreturn_wrapper)
-
-compat_sys_rt_sigreturn_wrapper:
- mov x0, sp
- mov x27, #0 // prevent syscall restart handling (why)
- b compat_sys_rt_sigreturn
-ENDPROC(compat_sys_rt_sigreturn_wrapper)
-
-compat_sys_statfs64_wrapper:
- mov w3, #84
- cmp w1, #88
- csel w1, w3, w1, eq
- b compat_sys_statfs64
-ENDPROC(compat_sys_statfs64_wrapper)
-
-compat_sys_fstatfs64_wrapper:
- mov w3, #84
- cmp w1, #88
- csel w1, w3, w1, eq
- b compat_sys_fstatfs64
-ENDPROC(compat_sys_fstatfs64_wrapper)
-
-/*
- * Wrappers for AArch32 syscalls that either take 64-bit parameters
- * in registers or that take 32-bit parameters which require sign
- * extension.
- */
-compat_sys_pread64_wrapper:
- regs_to_64 x3, x4, x5
- b sys_pread64
-ENDPROC(compat_sys_pread64_wrapper)
-
-compat_sys_pwrite64_wrapper:
- regs_to_64 x3, x4, x5
- b sys_pwrite64
-ENDPROC(compat_sys_pwrite64_wrapper)
-
-compat_sys_truncate64_wrapper:
- regs_to_64 x1, x2, x3
- b sys_truncate
-ENDPROC(compat_sys_truncate64_wrapper)
-
-compat_sys_ftruncate64_wrapper:
- regs_to_64 x1, x2, x3
- b sys_ftruncate
-ENDPROC(compat_sys_ftruncate64_wrapper)
-
-compat_sys_readahead_wrapper:
- regs_to_64 x1, x2, x3
- mov w2, w4
- b sys_readahead
-ENDPROC(compat_sys_readahead_wrapper)
-
-compat_sys_fadvise64_64_wrapper:
- mov w6, w1
- regs_to_64 x1, x2, x3
- regs_to_64 x2, x4, x5
- mov w3, w6
- b sys_fadvise64_64
-ENDPROC(compat_sys_fadvise64_64_wrapper)
-
-compat_sys_sync_file_range2_wrapper:
- regs_to_64 x2, x2, x3
- regs_to_64 x3, x4, x5
- b sys_sync_file_range2
-ENDPROC(compat_sys_sync_file_range2_wrapper)
-
-compat_sys_fallocate_wrapper:
- regs_to_64 x2, x2, x3
- regs_to_64 x3, x4, x5
- b sys_fallocate
-ENDPROC(compat_sys_fallocate_wrapper)
-
-#undef __SYSCALL
-#define __SYSCALL(x, y) .quad y // x
-
-/*
- * The system calls table must be 4KB aligned.
- */
- .align 12
-ENTRY(compat_sys_call_table)
-#include <asm/unistd32.h>
diff --git a/arch/arm64/kernel/sys32.c b/arch/arm64/kernel/sys32.c
new file mode 100644
index 0000000..7800bb1
--- /dev/null
+++ b/arch/arm64/kernel/sys32.c
@@ -0,0 +1,57 @@
+/*
+ * arch/arm64/kernel/sys32.c
+ *
+ * Copyright (C) 2015 ARM Ltd.
+ *
+ * This program is free software(void); you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http(void);//www.gnu.org/licenses/>.
+ */
+
+/*
+ * Needed to avoid conflicting __NR_* macros between uapi/asm/unistd.h and
+ * asm/unistd32.h.
+ */
+#define __COMPAT_SYSCALL_NR
+
+#include <linux/compiler.h>
+#include <linux/syscalls.h>
+#include <asm/page.h>
+
+asmlinkage long compat_sys_sigreturn_wrapper(void);
+asmlinkage long compat_sys_rt_sigreturn_wrapper(void);
+asmlinkage long compat_sys_statfs64_wrapper(void);
+asmlinkage long compat_sys_fstatfs64_wrapper(void);
+asmlinkage long compat_sys_pread64_wrapper(void);
+asmlinkage long compat_sys_pwrite64_wrapper(void);
+asmlinkage long compat_sys_truncate64_wrapper(void);
+asmlinkage long compat_sys_ftruncate64_wrapper(void);
+asmlinkage long compat_sys_readahead_wrapper(void);
+asmlinkage long compat_sys_fadvise64_64_wrapper(void);
+asmlinkage long compat_sys_sync_file_range2_wrapper(void);
+asmlinkage long compat_sys_fallocate_wrapper(void);
+#if PAGE_SHIFT > 12
+asmlinkage long compat_sys_mmap2_wrapper(void);
+#else
+#define compat_sys_mmap2_wrapper sys_mmap_pgoff
+#endif
+
+#undef __SYSCALL
+#define __SYSCALL(nr, sym) [nr] = sym,
+
+/*
+ * The sys_call_table array must be 4K aligned to be accessible from
+ * kernel/entry.S.
+ */
+void * const compat_sys_call_table[__NR_compat_syscalls] __aligned(4096) = {
+ [0 ... __NR_compat_syscalls - 1] = sys_ni_syscall,
+#include <asm/unistd32.h>
+};
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 41cb6d3..e82c79e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -211,7 +211,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
  * If we're in an interrupt or have no user context, we must not take
  * the fault.
  */
- if (in_atomic() || !mm)
+ if (!mm || pagefault_disabled())
  goto no_context;

  if (user_mode(regs))
@@ -358,6 +358,8 @@ static int __kprobes do_translation_fault(unsigned long addr,
  if (addr < TASK_SIZE)
  return do_page_fault(addr, esr, regs);

+ if (interrupts_enabled(regs))
+ local_irq_enable();
  do_bad_area(addr, esr, regs);
  return 0;
 }
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e649426..ab25814 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);

 asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
      const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+     const compat_uptr_t __user *argv,
+     const compat_uptr_t __user *envp, int flags);

 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
  compat_ulong_t __user *outp, compat_ulong_t __user *exp,
@@ -686,6 +689,15 @@ asmlinkage long compat_sys_sendfile64(int out_fd, int in_fd,
 asmlinkage long compat_sys_sigaltstack(const compat_stack_t __user *uss_ptr,
        compat_stack_t __user *uoss_ptr);

+#ifdef __ARCH_WANT_SYS_SIGPENDING
+asmlinkage long compat_sys_sigpending(compat_old_sigset_t __user *set);
+#endif
+
+#ifdef __ARCH_WANT_SYS_SIGPROCMASK
+asmlinkage long compat_sys_sigprocmask(int how, compat_old_sigset_t __user *nset,
+       compat_old_sigset_t __user *oset);
+#endif
+
 int compat_restore_altstack(const compat_stack_t __user *uss);
 int __compat_save_altstack(compat_stack_t __user *, unsigned long);
 #define compat_save_altstack_ex(uss, sp) do { \
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81..96b2ded 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -410,12 +410,16 @@ asmlinkage long sys_newlstat(const char __user *filename,
  struct stat __user *statbuf);
 asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
 asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
-#if BITS_PER_LONG == 32
+#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
 asmlinkage long sys_stat64(const char __user *filename,
  struct stat64 __user *statbuf);
 asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
 asmlinkage long sys_lstat64(const char __user *filename,
  struct stat64 __user *statbuf);
+asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
+       struct stat64 __user *statbuf, int flag);
+#endif
+#if BITS_PER_LONG == 32
 asmlinkage long sys_truncate64(const char __user *path, loff_t length);
 asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
 #endif
@@ -771,8 +775,6 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
    umode_t mode);
 asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
        struct stat __user *statbuf, int flag);
-asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
-       struct stat64 __user *statbuf, int flag);
 asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *buf,
        int bufsiz);
 asmlinkage long sys_utimensat(int dfd, const char __user *filename,
--
1.9.1


After few days , i could see the patch discussion at linux mailing list for the same.

The change i could see is , i used  non zero comparable assembly code and in mailing list they used zero comparable assembly code.


Cyclic test latency diagram with this change:



arch/arm64 :Cyclic Test fix in ARM64 fpsimd

Please follow the discussion in kernel mailing list about this patch for clear update.

We are able to achieve very good latency with this below fix. Lot of discussion going in linux kernel mailing list to make it better.

From e6a5fce9b3b55f48656240036a9354a0997c2907 Mon Sep 17 00:00:00 2001
From: Ayyappa Ch <ayyappa.chandolu@amd.com>
Date: Tue, 18 Apr 2015 11:53:00 +0530
Subject: [PATCH ] floating point realtime fix

---
 arch/arm64/kernel/fpsimd.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 2438497..3dca156 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -166,10 +166,10 @@ void fpsimd_flush_thread(void)
  */
 void fpsimd_preserve_current_state(void)
 {
- preempt_disable();
+ migrate_disable();
  if (!test_thread_flag(TIF_FOREIGN_FPSTATE))
  fpsimd_save_state(&current->thread.fpsimd_state);
- preempt_enable();
+ migrate_enable();
 }

 /*
@@ -179,7 +179,7 @@ void fpsimd_preserve_current_state(void)
  */
 void fpsimd_restore_current_state(void)
 {
- preempt_disable();
+ migrate_disable();
  if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
  struct fpsimd_state *st = &current->thread.fpsimd_state;

@@ -187,7 +187,7 @@ void fpsimd_restore_current_state(void)
  this_cpu_write(fpsimd_last_state, st);
  st->cpu = smp_processor_id();
  }
- preempt_enable();
+ migrate_enable();
 }

 /*
@@ -197,7 +197,7 @@ void fpsimd_restore_current_state(void)
  */
 void fpsimd_update_current_state(struct fpsimd_state *state)
 {
- preempt_disable();
+ migrate_disable();
  fpsimd_load_state(state);
  if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
  struct fpsimd_state *st = &current->thread.fpsimd_state;
@@ -205,7 +205,7 @@ void fpsimd_update_current_state(struct fpsimd_state *state)
  this_cpu_write(fpsimd_last_state, st);
  st->cpu = smp_processor_id();
  }
- preempt_enable();
+ migrate_enable();
 }

 /*
@@ -239,7 +239,7 @@ void kernel_neon_begin_partial(u32 num_regs)
  * that there is no longer userland FPSIMD state in the
  * registers.
  */
- preempt_disable();
+ migrate_disable();
  if (current->mm &&
     !test_and_set_thread_flag(TIF_FOREIGN_FPSTATE))
  fpsimd_save_state(&current->thread.fpsimd_state);
@@ -255,7 +255,7 @@ void kernel_neon_end(void)
  in_irq() ? &hardirq_fpsimdstate : &softirq_fpsimdstate);
  fpsimd_load_partial_state(s);
  } else {
- preempt_enable();
+ migrate_enable();
  }
 }
 EXPORT_SYMBOL(kernel_neon_end);
--
1.9.1

It seems  the kernel won't save the state of the simd registers when it is preempted so if another task runs on the same CPU and also uses simd, it clobbers the registers of the first task, and migrate_disable() does not
prevent that. So , with the above patch there is a possibility that simd instruction will get corrupted.

Keep watching the kernel mailing list about this issue for the final fix to improve latency issue with fpsimd .

Wednesday, April 22, 2015

Linux Open Source Graphics Stack and Multimedia Stack

We will talk about Linux Graphics stack components needed for using GPU functionality
















X.Org Server:
------------------
X.Org Server refers to the free and open source implementation of the X Window System stewarded by the X.Org Foundation that includes not only the display server but also the client libraries (like Xlib and XCB), developer and user tools, and the rest of the components required to run an entire 
X Window System architecture.

X org sever devided into 2 components:
Device Independent X (DIX) : The Device Independent X (DIX) is the part of the 2D graphics device driver, which is not specific to any hardware.

Device Dependent X (DDX) : The Device Dependent X (DDX) is the part of the 2D graphics device driver, which is hardware specific. 

AMD's proprietary Catalyst includes such an extra device driver, just for the X.Org Server, additionally to the actual kernel blobs and user space device driver. 

















Glamor :
-----------
Glamor is a generic 2D acceleration driver for the X server that works translating the X render primitives to OpenGL operations taking advantage of any existing 3D OpenGL drivers, proprietary and open-source.
The ultimate goal of GLAMOR is to obsolete and replace all the DDX, the device dependent X drivers, and acceleration architectures for them like XAA, EXA, UXA or SNA) by a single hardware independent 2D driver, avoiding the need to write X 2D specific drivers for every supported graphic 
chipset. Glamor requires a 3D driver with shader support.

Glamor is a GL-based rendering acceleration library for X server:
------------------------------------------------------------------------------
- OpenGL based 2d rendering acceleration library. Transform x-rendering into OpenGL and EGL.
- It uses GL functions and shader to complete the 2D graphics operations. 
- It uses normal texture to represent a drawable pixmap if possible. 
- It calls GL functions to render to the texture directly. 
- It’s somehow hardware independently. And could be a building block of any X server’s DDX driver.


Lets discuss to have a brief idea about OpenMax (Open Media Acceleration ):
-------------------------------------------------------------------------------------------------

OpenMAX provides three layers of interfaces:
Application layer (AL),
Integration layer (IL) and
Development layer (DL).

OpenMAX AL:
-----------------
1)It is the interface between multimedia applications, such as a media player, and the platform media framework.

2)It allows companies that develop applications to easily migrate their applications to different platforms (customers) that support  the OpenMAX AL application programming interface (API)

OpemMAX IL:
------------------
1)It is the interface between media framework,(such as StageFright or MediaCodec API on Android, DirectShow on Windows, FFmpeg or Libav on Linux, or GStreamer for cross-platform),  and a set of multimedia components (such as an audio or video codecs)

2)It allows companies that build platforms (e.g. allowing an implementation of an MP3 player) to easily change components like MP3 decoders and Equalizer effects and buy components for their platform from different vendors

In the OpenMAX IL, components represent individual blocks of functionality. Components can be sources, sinks, codecs, filters, splitters, mixers, or any other data operator. Depending on the implementation, a component could possibly represent a piece of hardware, a software codec,
another processor, or a combination thereof.

The OpenMAX IL API allows the user to load, control, connect, and unload the individual components. This flexible core architecture allows the Integration Layer to easily implement almost any media use case and mesh with existing graph-based media frameworks. The key focus of the OpenMAX IL API is portability of media components.

Applications using the GStreamer API would take advantage of hardware acceleration on platforms that provide it, when OMX IL support is integrated.

Bellagio is an open source OpenMAX IL implementation for Linux maintained by STMicroelectronics.


GStreamer (GST) is an open source multimedia framework used by several application, and it can use OpenMAX IL modules with its "gst-omx" module.







OpenMAX DL:
------------------
1)It is the interface between physical hardware, such as digital signal processor (DSP) chips and CPUs, and software, like video codecs and 3D engines.

2)It allows companies to easily integrate new hardware that supports OpenMAX DL without reoptimizing their low level software.


MESA (COMPUTER GRAPHICS) :
-----------------------------------------------
Mesa is a collection of free and open-source libraries that implement several rendering as well as video acceleration APIs related to hardware-accelerated 3D rendering, 3D computer graphics and GPGPU, the most prominent being OpenGL.



Hardware acceleration allows to use specific devices (usually graphical card or other specific devices) to perform multimedia processing. This allows to use dedicated hardware to perform demanding computation while freeing the CPU from such computations.

- VDPAU (Video Decode and Presentation API for Unix)
- Video Acceleration API (VA API)
- Direct-X Video Acceleration API
- Video Decoding API

VDPAU(Video Decode and Presentation API):
---------------------------------------------------------
Video Decode and Presentation API for Unix is an open source library and API to offload portions of the video decoding process and video post-processing to the GPU video-hardware.

1)Hardware decoding of MPEG-1, MPEG-2, MPEG-4 part 2, H.264, VC-1, and DivX 4 and 5 bitstreams on supported hardware, with a bitstream (VLD) level API.

2)Video post-processing including advanced deinterlacing algorithms, inverse telecine, noise reduction, configurable color space conversion, and procamp adjustments.

3)Sub-picture, on-screen display, and UI element compositing.

4) Direct rendering timestamp-based presentation of final video frames, with detailed frame delivery reports.

UVD is AMD's dedicated video decoding ASIC.
Video Codec Engine (VCE) is the name given to AMD's video encoding ASIC.

The free radeon driver supports Unified Video Decoder (UVD )and Video Codec Engine (VCE) through VDPAU and OpenMAX.

VDPAU (Video Decode and Presentation API for Unix), a competing API designed by NVIDIA, can potentially also be used as a backend for the VA API. If this is supported, any software that supports VA API then also indirectly supports a subset of VDPAU.

Video Acceleration API (VA API):
--------------------------------------------
The VA API specification was originally designed by Intel.

The main motivation for VA API is to enable hardware-accelerated video decode at various entry-points (VLD, IDCT, motion compensation, deblocking) for the prevailing coding standards today (MPEG-2, MPEG-4 ASP/H.263, MPEG-4 AVC/H.264, and VC-1/WMV3). 

use case : vlc media player -> vaapi (Video Acceleration API) -> vdpau driver -> GPU (  Unified Video Decoder provided by AMD GPU)







Free and open-source graphics device driver:
------------------------------------------------------
- Most free and open source graphics device drivers are developed via the Mesa project.

- Linux kernel component DRM

- Linux kernel component KMS driver: basically the device driver for the display controller

- user-space component libDRM: a wrapper library for the system calls of the DRM, should only be used by Mesa 3D

- user-space component in Mesa 3D: this component is highly hardware specific, is being executed on the CPU and does the translation of e.g. OpenGL commands  into machine code for the GPU; because of the split nature of the device driver, marshalling is possible;

- Mesa 3D is the only available free and open-source implementation of OpenGL, OpenGL ES, OpenVG, GLX, EGL and OpenCL as of July 2014 most of these components are written conforming to the Gallium3D-specifications;


DRI (Direct Rendering Infrastructure):
-------------------------------------------------
-The Direct Rendering Infrastructure (DRI) is a framework for allowing direct access to graphics hardware under the X Window System in a safe, efficient way.

-The main use of DRI is to provide hardware acceleration for the Mesa implementation of OpenGL.

-DRI implementation is scattered through the X Server and its associated client libraries, Mesa 3D and the Direct Rendering Manager kernel subsystem.






 DRI is split into three parts:
 -----------------------------------
1)the Direct Rendering Manager (DRM), a kernel component, for command checking and queuing (not scheduling).

2)the Mesa 3D device drivers, a userspace component, that does the translation of OpenGL commands into hardware specific commands; it prepares buffers of commands to be sent to the hardware by the DRM and interacts with the windowing system for synchronization of access to
the hardware

3)The hardware specific library libdrm implements the userspace interface to the kernel DRM. Libdrm contains a full set of functions to obtain information about encoders, connectors

use case : Application -> xlib -> xserver X.org  -> mesa DRI driver -> libdrm -> drm -> GPU 




















Gallium3D:
--------------
DRI driver converts OpenGL->hardware translator. But Gallium3d driver  interface will assume the presence of programmable vertex/fragment shaders and flexible memory objects.

1) Make drivers smaller and simpler.

Current DRI drivers are rather complicated. They're large, contain duplicated code and are burdened with implementing many concepts tightly tied to the OpenGL 1.x/2.x API.

2) Model modern graphics hardware.

The new driver architecture is an abstraction of modern graphics hardware, rather than an
OpenGL->hardware translator. The new driver interface will assume the presence of programmable vertex/fragment shaders and flexible memory objects.

3) Support multiple graphics APIs.

The reduced OpenGL 3.1+ APIs will be much smaller than OpenGL 1.x/2.x. We'd like a driver model that is API-neutral so that it's not tied to a  specific graphics API.

4) Support multiple operating systems.

Gallium drivers have no OS-specific code (OS-specific code goes into the "winsys/screen" modules) so they're portable to Linux, Windows and other  operating systems.









Direct Rendering Manager:
--------------------------------
The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards.

DRM exposes an API that user space programs can use to send commands and data to the GPU, and perform operations such as configuring the mode setting of the display.

Each GPU detected by DRM is referred as a DRM device, and a device file /dev/dri/cardX (where X is a sequential number) is created to interface with it.

A library called libdrm was created to facilitate the interface of user space programs with the DRM subsystem. This library is merely a wrapper that provides a function written in C for every ioctl of the DRM API






FFMPEG/libav:
-------------------
-libav is a library that contains all kinds of codecs, support for various container formats, some filters

-It's a library providing some API to use these things separately.

-Container parser, Software Implementation of Audio/Video decoders, SW Scaling

-Plain GStreamer can't do anything without plugins, GStreamer requires various plugins like, source(file,http,ftp,..), demux(mp4,3gp,flv,TS,..)  & decoder(H.264,H.263,MP3,AAC). GStreamer is a broader library, and can actually use FFmpeg plugins.


Does FFMPEG use hardware acceleration?
FFmpeg provides a subsystem for hardware acceleration either uisng VDPAU , VAAPI , Direct X  or other required API.