2005-04-17 02:20:36 +04:00
/* Copyright 2002 Andi Kleen */
2006-10-04 11:38:54 +04:00
2006-09-26 12:52:32 +04:00
# include < l i n u x / l i n k a g e . h >
2009-03-12 14:20:17 +03:00
2006-09-26 12:52:32 +04:00
# include < a s m / c p u f e a t u r e . h >
2009-03-12 14:20:17 +03:00
# include < a s m / d w a r f2 . h >
2011-05-18 02:29:16 +04:00
# include < a s m / a l t e r n a t i v e - a s m . h >
2006-09-26 12:52:32 +04:00
2005-04-17 02:20:36 +04:00
/ *
* memcpy - C o p y a m e m o r y b l o c k .
*
2009-03-12 14:20:17 +03:00
* Input :
* rdi d e s t i n a t i o n
* rsi s o u r c e
* rdx c o u n t
*
2005-04-17 02:20:36 +04:00
* Output :
* rax o r i g i n a l d e s t i n a t i o n
2009-03-12 14:20:17 +03:00
* /
2005-04-17 02:20:36 +04:00
2009-03-12 14:20:17 +03:00
/ *
* memcpy_ c ( ) - f a s t s t r i n g o p s ( R E P M O V S Q ) b a s e d v a r i a n t .
*
2009-12-18 19:16:03 +03:00
* This g e t s p a t c h e d o v e r t h e u n r o l l e d v a r i a n t ( b e l o w ) v i a t h e
2009-03-12 14:20:17 +03:00
* alternative i n s t r u c t i o n s f r a m e w o r k :
* /
2009-12-18 19:16:03 +03:00
.section .altinstr_replacement , " ax" , @progbits
.Lmemcpy_c :
2009-03-12 14:20:17 +03:00
movq % r d i , % r a x
movl % e d x , % e c x
shrl $ 3 , % e c x
andl $ 7 , % e d x
2006-09-26 12:52:32 +04:00
rep m o v s q
2009-03-12 14:20:17 +03:00
movl % e d x , % e c x
2006-09-26 12:52:32 +04:00
rep m o v s b
ret
2009-12-18 19:16:03 +03:00
.Lmemcpy_e :
.previous
2006-09-26 12:52:32 +04:00
2011-05-18 02:29:16 +04:00
/ *
* memcpy_ c _ e ( ) - e n h a n c e d f a s t s t r i n g m e m c p y . T h i s i s f a s t e r a n d s i m p l e r t h a n
* memcpy_ c . U s e m e m c p y _ c _ e w h e n p o s s i b l e .
*
* This g e t s p a t c h e d o v e r t h e u n r o l l e d v a r i a n t ( b e l o w ) v i a t h e
* alternative i n s t r u c t i o n s f r a m e w o r k :
* /
.section .altinstr_replacement , " ax" , @progbits
.Lmemcpy_c_e :
movq % r d i , % r a x
movl % e d x , % e c x
rep m o v s b
ret
.Lmemcpy_e_e :
.previous
2006-09-26 12:52:32 +04:00
ENTRY( _ _ m e m c p y )
ENTRY( m e m c p y )
CFI_ S T A R T P R O C
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
movq % r d i , % r a x
2006-02-03 23:51:02 +03:00
2009-03-12 14:20:17 +03:00
/ *
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
* Use 3 2 b i t C M P h e r e t o a v o i d l o n g N O P p a d d i n g .
2009-03-12 14:20:17 +03:00
* /
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
cmp $ 0 x20 , % e d x
jb . L h a n d l e _ t a i l
2006-02-03 23:51:02 +03:00
2009-03-12 14:20:17 +03:00
/ *
2011-05-01 16:09:21 +04:00
* We c h e c k w h e t h e r m e m o r y f a l s e d e p e n d e n c e c o u l d o c c u r ,
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
* then j u m p t o c o r r e s p o n d i n g c o p y m o d e .
2009-03-12 14:20:17 +03:00
* /
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
cmp % d i l , % s i l
jl . L c o p y _ b a c k w a r d
subl $ 0 x20 , % e d x
.Lcopy_forward_loop :
subq $ 0 x20 , % r d x
2006-02-03 23:51:02 +03:00
2009-03-12 14:20:17 +03:00
/ *
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
* Move i n b l o c k s o f 4 x8 b y t e s :
2009-03-12 14:20:17 +03:00
* /
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
movq 0 * 8 ( % r s i ) , % r8
movq 1 * 8 ( % r s i ) , % r9
movq 2 * 8 ( % r s i ) , % r10
movq 3 * 8 ( % r s i ) , % r11
leaq 4 * 8 ( % r s i ) , % r s i
movq % r8 , 0 * 8 ( % r d i )
movq % r9 , 1 * 8 ( % r d i )
movq % r10 , 2 * 8 ( % r d i )
movq % r11 , 3 * 8 ( % r d i )
leaq 4 * 8 ( % r d i ) , % r d i
jae . L c o p y _ f o r w a r d _ l o o p
addq $ 0 x20 , % r d x
jmp . L h a n d l e _ t a i l
.Lcopy_backward :
/ *
* Calculate c o p y p o s i t i o n t o t a i l .
* /
addq % r d x , % r s i
addq % r d x , % r d i
subq $ 0 x20 , % r d x
/ *
* At m o s t 3 A L U o p e r a t i o n s i n o n e c y c l e ,
* so a p p e n d N O P S i n t h e s a m e 1 6 b y t e s t r u n k .
* /
.p2align 4
.Lcopy_backward_loop :
subq $ 0 x20 , % r d x
movq - 1 * 8 ( % r s i ) , % r8
movq - 2 * 8 ( % r s i ) , % r9
movq - 3 * 8 ( % r s i ) , % r10
movq - 4 * 8 ( % r s i ) , % r11
leaq - 4 * 8 ( % r s i ) , % r s i
movq % r8 , - 1 * 8 ( % r d i )
movq % r9 , - 2 * 8 ( % r d i )
movq % r10 , - 3 * 8 ( % r d i )
movq % r11 , - 4 * 8 ( % r d i )
leaq - 4 * 8 ( % r d i ) , % r d i
jae . L c o p y _ b a c k w a r d _ l o o p
2006-02-03 23:51:02 +03:00
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
/ *
* Calculate c o p y p o s i t i o n t o h e a d .
* /
addq $ 0 x20 , % r d x
subq % r d x , % r s i
subq % r d x , % r d i
2006-02-03 23:51:02 +03:00
.Lhandle_tail :
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
cmpq $ 1 6 , % r d x
jb . L l e s s _ 1 6 b y t e s
2009-03-12 14:20:17 +03:00
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
/ *
* Move d a t a f r o m 1 6 b y t e s t o 3 1 b y t e s .
* /
movq 0 * 8 ( % r s i ) , % r8
movq 1 * 8 ( % r s i ) , % r9
movq - 2 * 8 ( % r s i , % r d x ) , % r10
movq - 1 * 8 ( % r s i , % r d x ) , % r11
movq % r8 , 0 * 8 ( % r d i )
movq % r9 , 1 * 8 ( % r d i )
movq % r10 , - 2 * 8 ( % r d i , % r d x )
movq % r11 , - 1 * 8 ( % r d i , % r d x )
retq
2006-02-03 23:51:02 +03:00
.p2align 4
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
.Lless_16bytes :
cmpq $ 8 , % r d x
jb . L l e s s _ 8 b y t e s
/ *
* Move d a t a f r o m 8 b y t e s t o 1 5 b y t e s .
* /
movq 0 * 8 ( % r s i ) , % r8
movq - 1 * 8 ( % r s i , % r d x ) , % r9
movq % r8 , 0 * 8 ( % r d i )
movq % r9 , - 1 * 8 ( % r d i , % r d x )
retq
.p2align 4
.Lless_8bytes :
cmpq $ 4 , % r d x
jb . L l e s s _ 3 b y t e s
2009-03-12 14:20:17 +03:00
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
/ *
* Move d a t a f r o m 4 b y t e s t o 7 b y t e s .
* /
movl ( % r s i ) , % e c x
movl - 4 ( % r s i , % r d x ) , % r8 d
movl % e c x , ( % r d i )
movl % r8 d , - 4 ( % r d i , % r d x )
retq
2006-02-03 23:51:02 +03:00
.p2align 4
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
.Lless_3bytes :
cmpl $ 0 , % e d x
je . L e n d
/ *
* Move d a t a f r o m 1 b y t e s t o 3 b y t e s .
* /
2006-02-03 23:51:02 +03:00
.Lloop_1 :
2009-03-12 14:20:17 +03:00
movb ( % r s i ) , % r8 b
movb % r8 b , ( % r d i )
2006-02-03 23:51:02 +03:00
incq % r d i
incq % r s i
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
decl % e d x
2006-02-03 23:51:02 +03:00
jnz . L l o o p _ 1
2009-03-12 14:20:17 +03:00
.Lend :
x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)
If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence. At last on Core2 we gain 1.83x speedup compared with
original instruction sequence. In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
use our micro-benchmark, and will do further test according to your
requirment)
Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-06-28 23:24:25 +04:00
retq
2006-09-26 12:52:32 +04:00
CFI_ E N D P R O C
ENDPROC( m e m c p y )
ENDPROC( _ _ m e m c p y )
2006-02-03 23:51:02 +03:00
2009-03-12 14:20:17 +03:00
/ *
2011-05-18 02:29:16 +04:00
* Some C P U s a r e a d d i n g e n h a n c e d R E P M O V S B / S T O S B f e a t u r e
* If t h e f e a t u r e i s s u p p o r t e d , m e m c p y _ c _ e ( ) i s t h e f i r s t c h o i c e .
* If e n h a n c e d r e p m o v s b c o p y i s n o t a v a i l a b l e , u s e f a s t s t r i n g c o p y
* memcpy_ c ( ) w h e n p o s s i b l e . T h i s i s f a s t e r a n d c o d e i s s i m p l e r t h a n
* original m e m c p y ( ) .
* Otherwise, o r i g i n a l m e m c p y ( ) i s u s e d .
* In . a l t i n s t r u c t i o n s s e c t i o n , E R M S f e a t u r e i s p l a c e d a f t e r R E G _ G O O D
* feature t o i m p l e m e n t t h e r i g h t p a t c h o r d e r .
*
2009-03-12 14:20:17 +03:00
* Replace o n l y b e g i n n i n g , m e m c p y i s u s e d t o a p p l y a l t e r n a t i v e s ,
* so i t i s s i l l y t o o v e r w r i t e i t s e l f w i t h n o p s - r e b o o t i s t h e
* only o u t c o m e . . .
* /
2011-05-18 02:29:16 +04:00
.section .altinstructions , " a"
altinstruction_ e n t r y m e m c p y ,. L m e m c p y _ c ,X 8 6 _ F E A T U R E _ R E P _ G O O D ,\
.Lmemcpy_e - .Lmemcpy_c , .Lmemcpy_e - .Lmemcpy_c
altinstruction_ e n t r y m e m c p y ,. L m e m c p y _ c _ e ,X 8 6 _ F E A T U R E _ E R M S , \
.Lmemcpy_e_e - .Lmemcpy_c_e , .Lmemcpy_e_e - .Lmemcpy_c_e
2006-02-03 23:51:02 +03:00
.previous