00:15:25 we don't allocate the initial stack. 00:15:41 the problem appears to be with protection 00:16:00 looks like it protects the wrong thing 00:18:57 though the math checks out 00:19:06 apparently 01:21:18 -!- scymtym_ [~user@ip-5-147-122-209.unitymediagroup.de] has quit [Ping timeout: 276 seconds] 01:53:56 could be there a problem with pthread_attr_setstack, "All pages within the stack described by stackaddr and stacksize shall be both readable and writable by the thread.", but we're putting the guards inside that space 01:54:43 i changed it to thread_control_stack_size - os_vm_page_size * 2, and it doesn't seem to be crashing anymore 01:57:07 i also have the binding guards disabled, need to check with them intact 02:07:45 it does indeed solve the problem, weird 02:24:54 I guess it *could* be some TLS thing... 02:27:14 why would it be triggered by alloca? 02:28:36 strange, without this - os_vm_page_size * 2 proc/maps shows the guard pages to be above [stack:45673], but with it the guard page is below 02:39:42 -!- christoph_debian [~christoph@ppp-188-174-146-202.dynamic.mnet-online.de] has quit [Ping timeout: 264 seconds] 02:46:05 non-working sb-dynamic-core prevents faster turn around 02:48:52 on x86, /proc/maps correctly identifies the relative location of page guards 02:52:14 pranavrc [~pranavrc@122.164.227.252] has joined #sbcl 02:52:14 -!- pranavrc [~pranavrc@122.164.227.252] has quit [Changing host] 02:52:14 pranavrc [~pranavrc@unaffiliated/pranavrc] has joined #sbcl 02:52:48 christoph_debian [~christoph@ppp-93-104-181-102.dynamic.mnet-online.de] has joined #sbcl 03:07:14 the assembly output of alloca(65536) doesn't suggest anything out of the ordinary 03:15:01 kanru [~kanru@118-163-10-190.HINET-IP.hinet.net] has joined #sbcl 03:35:16 -!- stassats` [~stassats@wikipedia/stassats] has quit [Ping timeout: 264 seconds] 04:43:21 Bike_ [~Glossina@wl-nat100.it.wsu.edu] has joined #sbcl 04:43:26 -!- Bike [~Glossina@wl-nat100.it.wsu.edu] has quit [Ping timeout: 264 seconds] 04:54:26 -!- Bike_ is now known as Bike 05:24:43 -!- pranavrc [~pranavrc@unaffiliated/pranavrc] has quit [Ping timeout: 264 seconds] 05:34:21 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl 06:08:45 -!- easye [~user@213.33.70.157] has quit [Read error: No route to host] 07:42:24 echo-area [~user@123.120.251.36] has joined #sbcl 08:13:25 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 276 seconds] 08:25:29 -!- edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has quit [Quit: lifetime expired because experience stopped] 08:40:19 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_] 08:41:46 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl 08:42:21 -!- jdz [~jdz@85.254.212.34] has quit [Quit: Byebye.] 08:46:06 jdz [~jdz@85.254.212.34] has joined #sbcl 08:48:46 -!- jdz [~jdz@85.254.212.34] has quit [Client Quit] 08:49:26 jdz [~jdz@85.254.212.34] has joined #sbcl 08:55:15 attila_lendvai [~attila_le@unaffiliated/attila-lendvai/x-3126965] has joined #sbcl 09:14:51 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_] 09:19:02 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl 11:08:31 yacks [~py@103.6.159.100] has joined #sbcl 11:12:51 -!- milosn_ [~milosn@user-5af507e0.broadband.tesco.net] has quit [Read error: Operation timed out] 11:18:29 milosn [~milosn@user-5af507e0.broadband.tesco.net] has joined #sbcl 11:42:25 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl 11:46:07 -!- ASau [~user@p4FF96CCC.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds] 12:12:06 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 264 seconds] 12:19:55 yacks [~py@103.6.159.100] has joined #sbcl 12:31:35 segv- [~mb@95-91-243-229-dynip.superkabel.de] has joined #sbcl 12:48:11 -!- echo-area [~user@123.120.251.36] has quit [Read error: Connection reset by peer] 12:56:53 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 240 seconds] 13:06:10 yacks [~py@103.6.159.100] has joined #sbcl 13:16:21 -!- psilord2 [~psilord@c-69-180-173-249.hsd1.mn.comcast.net] has quit [Quit: Leaving.] 13:16:55 edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has joined #sbcl 13:39:52 nyef [~nyef@pool-70-109-145-126.cncdnh.east.myfairpoint.net] has joined #sbcl 14:00:01 you know, with a page size of 64k, maybe we just end up allocating a tiny tiny stack. 14:10:47 -!- foom [jknight@nat/google/x-zxkmfprfmvinmljn] has quit [Ping timeout: 240 seconds] 14:11:13 foom [jknight@nat/google/x-vstwaceowsubvllm] has joined #sbcl 14:26:28 stassats` [~stassats@wikipedia/stassats] has joined #sbcl 14:31:07 pkhuong: it's 512K by default and even passing different values with --control-stack-size doesn't help 14:31:33 but low stack doesn't explain why setting pthread_attr_setstack two pages less is fixing this 14:32:53 another thing is that we have an unprotected whole after the guard, since it doesn't set a hard guard for threads, so it's 14:33:04 s/whole/hole/ 14:34:40 and i'm not sure why don't do that, the main thread stack has a hard guard 14:39:01 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 248 seconds] 14:40:10 -!- attila_lendvai [~attila_le@unaffiliated/attila-lendvai/x-3126965] has quit [Quit: Leaving.] 14:43:29 yacks [~py@103.6.158.105] has joined #sbcl 14:44:21 easye [~user@213.33.70.157] has joined #sbcl 14:55:50 setting a hard guard doesn't help, even more, some contribs fail to build 14:57:06 which is a bit odd 15:07:34 huh, setting stack: start: 0xf73c0000, end: 0xf7bc0000, size: 2097152 and then protecting page 0xf75a0000 15:08:33 control_stack_end seems to be completely wrong 15:10:07 and binding stack also ends up inside the control stack 15:13:44 Are the thread stacks set up correctly for a stacks-grow-upward regime? 15:14:09 the direction appears to be correct 15:14:37 Direction, starting position, and guard page location? 15:14:47 Hrm. 15:15:03 guard page location is weird, it appears to be in the middle, see the above print-out 15:15:20 maybe the binding stack and alien stack are designed to be inside the control stack 15:15:27 No, they're not. 15:15:58 Well, the number stack is supposed to be the alien stack, but... 15:16:35 thread_control_stack_size is 2097152, but (- sb-vm:*control-stack-end* sb-vm:*control-stack-start*) is 524288 15:17:10 n-fixnum-tag-bits? 15:19:00 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer] 15:19:00 Yeah, that's the fact that the static symbols have aligned untagged values. 15:19:32 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl 15:21:44 before setting the stack th->control_stack_start+thread_control_stack_size => 0xf7510000, but th->control_stack_end is 0xf6f10000 15:22:41 can (lispobj*)((void*)th->control_stack_start+thread_control_stack_size); be a problem? 15:22:58 that's basically how th->control_stack_end is calculated 15:23:09 in particular, the casts 15:26:22 we do arithmetic on a void*? 15:27:13 ah, gcc extension. 15:32:33 and lispobj is typedef unsigned int lispobj; 15:37:47 pkhuong: that example isn't void* arithmetic though. -> binds tighter than the cast 15:44:27 (lispobj*)((void*)th->control_stack_start+thread_control_stack_size) gives 0xf6f50000, but th->control_stack_start+thread_control_stack_size gives 0xf7550000 15:44:44 so, something is indeed wrong with the casts 15:46:07 weird. You sure? 15:46:35 -> definitely binds tighter than typecast, and as such there should be no difference 15:46:48 loke_: http://paste.lisp.org/display/138422 15:46:55 i'm not sure about anything anymore 15:50:27 w t f 15:51:18 what I see in your output would make sense if the typecast bound tigther (since the GCC extension allows void* arithmetic with an assumed size of 1) 15:51:36 oh wait a second 15:51:38 what does bind and tight mean? 15:51:49 it does bind tighter than ->, but not tighter than + 15:52:08 so it's equivalent to ((void*)th->control_stack_start)+thread_control_stack_size 15:52:27 It means higher precendence 15:52:39 that's what i thought 15:52:49 a+b*c, b and c are bound by the *, tighter than the a+b 15:53:03 i expected it to be ((void*)th->control_stack_start) 15:53:12 stassats`: It's a bit ugly, and at the very least it should cast to char*, not void* unles syou really _want_ to be GCC specific 15:53:26 control_stack_start is unsigned int * 15:53:53 stassats`: right, bit is _size in bytes? 15:54:24 size is in bytes, right 15:54:31 does it think it's in words? 15:54:41 and i don't want to be gcc specific, clang is a supported compiler too 15:59:42 stassats`: if the code works the way it is, then just change the cast to char* instead of void*, and you'll be standards compliant. 16:00:08 GCC just falls back to size=1 (i.e. char) if you try to do arithmetic on void* 16:00:13 stupid, in my opinion. :-) 16:00:26 and that's not the only place where that happens 16:01:55 but that's compiled with gcc, why does it break then? 16:02:37 I'm not sure that the snippet does anything wrong. control stack size is measure in bytes, right? 16:02:40 *measured 16:03:23 yes 16:05:56 oh, i see, th->control_stack_start+thread_control_stack_size is actually what is wrong 16:06:01 without casts 16:08:08 but that's just debug output, so it does not explain why anything is failing 16:09:22 though (+ #xf6f50000 2097152) still gives #xF7150000, not 0xf6f50000 16:10:39 I don't see how (+ #xf6f50000 2097152) could give 0xf6f50000 16:11:47 copied the wrong thing, #xf6d50000, and it's correct 16:11:51 so, back to square one 16:13:18 so, everything is correct, except that it start to work when i pass thread_control_stack_size-os_vm_page_size*2 to pthread_attr_setstack instead of just thread_control_stack_size 16:20:54 and alloca(65536) should go near to the page guard 16:21:10 i'd opt in for bad pointer arithmetic instead of this conundrum 16:23:11 time to build a C program trying to replicate this 16:26:35 -!- ASau` is now known as ASau 16:34:14 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_] 16:37:35 stassats`: unlikely, but protecting both guard pages in init_new_threads might cause another, easier to understand, failure. 16:38:26 by both you mean hard and "soft"? 16:38:58 protecting the hard page did indeed cause sb-concurrency sb-introspect and sb-bsd-sockets to fail to build, but i didn't look any further than that 16:39:43 i'm building a C program which creates a thread with a custom stack and similar page guards, hopefully it can be easier to debug 16:52:02 -!- segv- [~mb@95-91-243-229-dynip.superkabel.de] has quit [Remote host closed the connection] 16:55:04 another interesting thing, we try to align stack space by page size, by rounding the start up 16:55:19 but, after the rounding, the size should grow down, which it doesn't 16:55:43 which can explain why subtracting os_vm_page_size*2 from thread_control_stack_size helps 16:56:32 I think there's a comment that says some struct's size is pre-padded. 16:57:04 yeah, even if it weren't, it allocates much more than just control stack size, so that wouldn't matter 16:57:27 pranavrc [~pranavrc@122.164.235.4] has joined #sbcl 16:57:28 -!- pranavrc [~pranavrc@122.164.235.4] has quit [Changing host] 16:57:28 pranavrc [~pranavrc@unaffiliated/pranavrc] has joined #sbcl 16:58:41 segv- [~mb@95-91-243-229-dynip.superkabel.de] has joined #sbcl 16:59:53 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer] 17:00:16 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl 17:02:20 Odyessus [~odyessus@213.47.71.36] has joined #sbcl 17:03:38 stassats`: also, c-stack isn't control stack on ppc. 17:05:21 ... Right, number (alien) stack is control stack on PPC, isn't it? 17:07:01 Err... is c-stack, not control stack. Oops. 17:09:56 what is c-stack? 17:10:26 The control (and data) stack for C code. 17:10:45 As opposed to the control (and boxed data) stack for Lisp code. 17:11:57 -!- Odyessus [~odyessus@213.47.71.36] has quit [Quit: Colloquy for iPad - http://colloquy.mobi] 17:14:18 an address on the stack is printed as 0xf762eb10, with the control stack being 0xf7430000-0xf7630000, and the page guard being at 0xf7610000, binding stack starting from 0xf7630000 17:14:21 something is weird 17:17:53 -!- luis [~luis@kerno.org] has quit [Quit: ZNC - http://znc.sourceforge.net] 17:18:43 Address of what? 17:18:49 And what's the number stack range? 17:19:36 0xf762eb10 is printf("%p", alloca(10)); 17:19:59 and number stack being? 17:20:22 Is this for SBCL or a separate test program? 17:20:38 that's from a shared library loaded by sbcl 17:20:52 Okay, so the number stack or alien stack range? 17:21:30 there's no such thing as number_stack_start, which stack would contain it? 17:23:22 Might be alien_stack_start in struct thread. 17:23:43 alien_stack_guard_page is at 0xf7810000 17:24:00 (that's what i have printed right now) 17:25:19 Oh, god. 17:26:04 Is attach_os_thread() using pthread_attr_getstack() and setting the control stack to that value? 17:26:14 and binding guard is at 0xf74a0000, with binding stack starting from the end of the control stack 17:26:48 And create_os_thread using pthread_attr_setstack() with the control_stack parameters? 17:27:25 Mea culpa, mea culpa, maxima mea culpa, this is WRONG, WRONG, WRONG, and my fault for not seeing it when I did the original PPC thread conversion. 17:28:52 what does attach_os_thread do? 17:29:31 Looks like it creates a struct thread for an existing thread that doesn't have a corresponding Lisp thread structure. 17:29:56 Probably for call-into-lisp from a thread created by an external library. 17:30:40 that's what i thought 17:31:28 so, where do you say the problem lies? 17:32:00 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer] 17:32:06 When using pthread_attr_getstack() and pthread_attr_setstack(), it should be control_stack on x86oids only, and alien_stack elsewhere. 17:32:43 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl 17:33:00 Actually, scratch that. 17:33:18 The condition is control_stack on c_stack_is_control_stack systems, and alien_stack elsewhere. 17:33:45 ok, that's how i read it 17:33:48 let me try that 17:35:16 Clearly, I need to desk-check the thread logic and PPC backend at some point, just to be sure. I can see HOW I missed this, but I can't see that it was particularly excusable given how much other thread damage there was in the runtime. 17:36:36 i also need to clear up things with memory barriers after allocations, just using sync maybe too heavy-weight 17:36:57 (without sync it crashes and burns) 17:37:07 -!- slyrus [~chatzilla@107.200.11.156] has quit [Ping timeout: 264 seconds] 17:37:23 Look again: PPC memory barriers basically ARE all sync. 17:37:31 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer] 17:38:00 there are different options to sync, we just need that later writes are not seen earlier than the earlier writes 17:38:13 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl 17:38:52 stassats`: after PA, regardless of allocation, actually. 17:39:19 slyrus [~chatzilla@107.200.11.156] has joined #sbcl 17:39:47 can that also happen outside of it? or should we just delegate that to the user? 17:40:02 Probably should check to make sure there are barriers in the WITHOUT-GCING and WITHOUT-INTERRUPTS sequences as well. 17:40:52 For system consistency, you probably need barriers whenever you do something "against GC rules", which is to say whenever you do something that has to be atomic with respect to GC. 17:41:19 Outside of that, it's the user's responsibility to ensure consistency between threads in their own code. 17:41:30 makes sense 17:41:38 (of course, the user might be ourselves) 17:44:08 ok, changed it to the alien stack, but alien_stack_guard_page => 0xf7680000, and the stack is 0xf769eb10 17:44:16 i.e., after the guard 17:44:30 is this a downward/upward problem? 17:44:58 I scanned the runtime for that an hour ago, and nothing jumped out. 17:45:35 needless to say, getaddrinfo still doesn't work 17:46:03 -!- pranavrc [~pranavrc@unaffiliated/pranavrc] has quit [Ping timeout: 276 seconds] 17:46:14 but and now i get INFO: Alien stack guard page unprotected, instead of the control stack 17:46:28 Yeah, the alien stack should be growing upwards. 17:47:54 setting of th->alien_stack_pointer seems to be correctly protected by LISP_FEATURE_STACK_GROWS_DOWNWARD_NOT_UPWARD 17:48:43 hah, but the guard page is wrong 17:49:01 #define ALIEN_STACK_HARD_GUARD_PAGE(th) (((os_vm_address_t)th->alien_stack_start) + ALIEN_STACK_SIZE - \ os_vm_page_size) 17:49:09 wait, it's 17:49:18 didn't notice the + ALIEN_STACK_SIZE part 17:53:14 pnpuff [~Op125@unaffiliated/pnpuff] has joined #sbcl 17:53:41 -!- pnpuff [~Op125@unaffiliated/pnpuff] has left #sbcl 17:54:28 and on the main thread, the stack address is not inside the alien stack, but on the default C stack 17:54:53 I don't think I see where we switch stack in call_into_c. 17:55:32 so, it uses the one provided by pthread_attr_setstack and the default one? 17:57:51 On PPC? We don't switch stack, Lisp uses a different stack pointer than C does. 17:58:07 (Specifically, you should find that the Lisp reg_NSP is the C stack pointer.) 17:59:16 ah, ok. 18:05:25 *stassats`* goes to sprinkle some more print statements 18:48:01 alien start: 0xf7370000, end: 0xf7470000, but /proc/maps says f7460000-f7720000 rwxp 00000000 00:00 0 [stack:48950] 18:49:17 and the address of a stack variable is 0xf746eae0, which confirms what /proc/maps says 18:55:23 what does that snippet from /proc/maps hint at? 18:56:26 that the end and the start are mixed up 18:57:32 http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_attr_setstack.c;h=4bd314e66aefb6268b7757712eec87e0c9a334e3;hb=HEAD#l31 is a bit strange 18:57:46 rpg [~rpg@216.243.156.16.real-time.com] has joined #sbcl 18:58:04 it sets the address to stackaddr + stacksize, which is consistent what is seen at /proc/maps, but not with what is needed 18:58:22 "stackaddr should point to the lowest addressable byte" 18:59:00 which may as well work on grows-downward 18:59:19 no. 18:59:57 I see what you mean. 19:01:39 but the math doesn't add up, it would start at 0xf7470000 if that were the case 19:02:25 well. easy workaround: call setstack{addr,size} separately. 19:03:07 addr is deprecated 19:04:04 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Ping timeout: 264 seconds] 19:04:17 yeah, well it doesn't do the fixup magic. 19:04:21 i find it hard to believe that glibc is doing something wrong, i must be missing something 19:04:47 easy way to test that hypothesis: call setstackaddr directly. 19:05:34 http://sourceware.org/git/?p=glibc.git;a=commit;f=nptl/pthread_attr_setstack.c;h=76a50749f7af5935ba3739e815aa6a16ae4440d1 says the fixup is intentional 19:10:00 so, pehrps /proc/maps is wrong 19:10:05 perhaps 19:13:40 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds] 19:14:11 stassats`: I guess a simple test program would create a new thread and print sp. 19:20:49 loke_ [~elias@2001:470:36:b4a:589f:e2d2:cf5a:264f] has joined #sbcl 19:26:42 ASau [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl 19:28:33 in my c program, it appears to be below the start of the stack 19:29:38 http://paste.lisp.org/display/138422#1 19:36:03 stassats`: and what setstackaddr? 19:37:14 blimey, i commented out &attr and put NULL instead 19:37:26 disregard that output 19:38:36 ok, with &attr i get stack start: 0x3fff876b0000, end: 0x3fff878b0000 stack = 0x3fff8788e710, and if i enable mprotect, it segfaults 19:38:42 just like in sbcl 19:43:02 ok, and if you setstackaddr instead? 19:43:24 i get a segmentation fault without mprotect 19:43:53 it seems setstacksize doesn't have any effect in this case 19:44:30 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds] 19:46:31 -!- edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has quit [Quit: experience lost because computer sucks] 19:48:19 sdemarre [~serge@70.122-64-87.adsl-dyn.isp.belgacom.be] has joined #sbcl 19:50:21 ASau [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl 19:53:08 and again, specifying THREAD_SIZE-page_size fixes that 19:53:51 what a peculiar situation, i have a fix, but have no idea why it works 19:54:56 luis` [~luis@kerno.org] has joined #sbcl 19:55:07 i mean, fixes that crash caused by mprotect 19:58:24 And the problem is, if you don't know why it works, you don't know that it doesn't cause other problems down the line. 19:58:31 exactly 19:59:03 however, my example seems to crash during allocate_stack of pthreads 19:59:06 which does /* The user provided stack memory needs to be cleared. */ memset (pd, '\0', sizeof (struct pthread)); 19:59:33 pd defined as pd = (struct pthread *) ((uintptr_t) attr->stackaddr - TLS_TCB_SIZE - adj); 19:59:39 so, it store something at the end of the stack 19:59:41 stores 20:00:05 maybe that's why protecting it breaks, and does not break on x86, because it protects the lower part 20:00:42 that would explain passing the lower thread size, but in my test program it happens directly in pthread_create, while sbcl crashes only after alloca 20:00:48 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Remote host closed the connection] 20:03:16 this bug is fun 20:07:30 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl 20:08:55 now i'm completely baffled, i've got stack growing downwards on ppc for me 20:13:14 http://www.ibm.com/developerworks/linux/library/l-powasm4/index.html says "stack grows downward" 20:13:21 whaaaa 20:17:31 glibc: /* On PPC the stack grows down. */ #define _STACK_GROWS_DOWN 1 20:17:31 20:17:34 i demand a refund 20:17:48 ... Okay, what does SB-ALIEN and number-stack do on PPC? 20:18:11 SB-ALIEN would be src/compiler/target/ppc/c-call.lisp. 20:18:17 Err... 20:18:24 src/compiler/ppc/c-call.lisp. 20:18:30 do as in? 20:18:46 There should be a VOP for allocating stack space, does it go up or down? 20:19:15 There should also be a VOP somewhere used for allocating number stack space, possibly in ppc/call.lisp, does THAT go up or down? 20:20:43 Less definitive but still strongly indicative, in ppc/insts.lisp, are TNs with a number-stack SB emitted with a positive or a negative offset, and from what register? 20:22:05 I think machine code is really agnostic about this (calls work via the link register, right?) 20:22:19 so we might grow upward in lisp when C grows downward :\ 20:23:04 and the problem manifests itself when we set up the guard pages and the like with the idea that it grows upward 20:23:10 this is hilarious 20:23:13 Yeah, but if SBCL is doing one thing with the number stack while C does another then we have a systemic problem, and plausibly have to define separate direction conditionals for each stack. 20:23:30 somewhere, Steve Jobs is laughing at us 20:24:15 Krystof: Isn't it, though? And yet, it's still indicative of a failure in our development practice. 20:24:49 well now. That's assuming we have a development practice :-) 20:25:03 he knew all along to abandon PPC 20:25:10 I would describe it more as a series of uncoordinated events 20:25:21 yeah. stack grows upward on PPC/Lisp 20:25:41 stassats`: actually, the problem manifests itself as C code randomly overwriting our number stack :| 20:26:45 Mmm. C code screwing our number stack, and then number stack usage screwing C stack frames... 20:26:57 LOADW is (lwz object base (- byte-offset lowtag)), and load-stack-tn is (loadw reg cfp-tn offset) 20:27:12 how come i'm the first one to notice that? 20:27:18 And is LOAD-STACK-TN used for number-stack TNs? 20:27:46 no. but it's a copy paste 20:28:01 And is the NFP for a frame taken before or after adjusting the NSP? 20:28:11 I'm thinking probably before, but I'm not certain. 20:28:22 (not sure if the easiest fix is to make the number stack grow downward or make everything grow downward ;) 20:28:38 wanna best stack grows upward on darwin or some such? 20:29:01 no bet 20:29:05 I don't want to see the easiest fix, I want to see the fix that gives us the most flexibility to adapt to crazy stack regimes going forward. 20:29:12 s/best/bet/ 20:29:29 #!+stack-grows-both-upwards-and-downwards-and-maybe-sideways 20:29:32 see, I don't want crazy stack regimes ;) 20:30:07 #!+heap-allocated-number-stack-frames ? 20:31:05 nfp grows down. 20:31:08 oh dear. 20:31:15 i predicted that it would be something stupid, but on the third day i was convinced that it couldn't be, the fifth day eventually confirmed that it is indeed something stupid 20:31:48 always looking at it with the assumption that it grows upward made finding it quite difficult 20:31:52 "compute-old-nfp does (inst addi result nfp [number-stack-frame-size]) 20:31:52 NFP on PPC needs to grow down. 20:32:21 yeah. ok. so it's actually correct, it's "just" our page protection logic that's messed up? 20:32:26 Is it just that the guard pages are located stupidly, then? 20:33:57 do we use the alien stack to allocate things for with-alien? 20:34:45 stassats`: look on the bright side. On the sixth day, it's fixed, and you can rest on the seventh day 20:35:10 *nyef* applauds. 20:35:13 the stack grows downward on os x ppc 20:35:59 So, proposal: Rename CONTROL-STACK-GROWS-DOWNWARD-NOT-UPWARD to CONTROL-STACK-GROWS-DOWNWARD (one commit). Add ALIEN-STACK-GROWS-DOWNWARD, defined on PPC and x86oids (one commit). Use ALIEN-STACK-GROWS-DOWNWARD to correctly set the alien-stack guard pages (one commit). 20:36:54 and change pthread_create to use alien-stack on non-alien-stack-control-stacks 20:36:59 And sorting out the issue with setting the C stack (alien stack) to overlap the control stack should be a separate commit as well, yes. 20:37:18 Possibly two commits, one for attach and one for create. 20:37:50 Does the initial thread go via attach, or is there a third place to worry about? 20:39:08 does alien stack grow downward only on x86oids and ppc? 20:39:25 nyef: also, I think this means unboxed DX allocation is extra hard on PPC. 20:40:00 pkhuong: I wouldn't be surprised... do we do unboxed DX allocation on PPC yet? 20:40:17 If so, it needs to be desk-checked, and possibly test-cases written. 20:40:30 I don't think so. stack grow downward, but we address TN by adding an offset from NFP, so tricks with NSP are looking pretty hairy. 20:40:54 By offset from NFP, so tricks with NSP are looking pretty safe? 20:41:21 the offset is in the wrong direction though, so we can't easily restore NSP. 20:41:40 i.e. NFP isn't anything like ONSP. 20:42:03 on SPARC, stack grows downwards as well, at least that's how glibc defines it 20:44:01 -!- ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 256 seconds] 20:46:25 and i guess because it grew downward, and lisp stack grew upward, it worked with the both stacks being inside the control stack 20:46:53 fsvo "worked" 20:47:22 nyef: but I think we're actually good there. We end up doing nsp <- nfp + frame_size. 20:54:48 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl 20:54:53 what's the benefit of the control stack growing upward? 20:55:08 when the c stack grows the opposite direction 20:55:22 marginally easier to think about. 20:56:42 too bad mips and sparc machines of the gcc compile farm are down currently 20:59:51 ehaliewicz [~user@50-0-51-11.dsl.static.sonic.net] has joined #sbcl 20:59:59 the alloca test now passes 21:00:41 and slime with a dedicated output stream (a.k.a. the addrinfo test) 21:00:52 incredible. 21:01:09 that and barrier, and ppc is almost usable? 21:01:15 right 21:03:26 frlock.1 test of sb-concurrency still fails intermittently 21:05:26 how does it fail? timeout? 21:07:07 wrong expected values 21:07:38 looks like an ordering issue 21:09:35 what value do we receive? 21:10:10 -!- rpg [~rpg@216.243.156.16.real-time.com] has quit [Quit: rpg] 21:11:13 r-e! is non-nil 21:11:47 oh. I see. 21:11:53 frlock are borked. 21:12:11 unless (eql c (+ a b)) fails 21:12:39 frlock-read-begin should loop while a writer holds the write mutex. 21:14:03 no. it does something with two counters instead. 21:17:48 stassats`: in release-frlock-write-lock, can you move the :write barrier to the beginning of the function? 21:19:08 doesn't help 21:20:34 and move the read barrier to the end of frlock-read-begin (+ m-v-prog1) 21:20:59 erh, or just wrap the values form in barrier 21:22:28 the pairing was all wrong. 21:22:59 no failures in 10 runs so far 21:24:46 rpg [~rpg@c-71-63-247-69.hsd1.mn.comcast.net] has joined #sbcl 21:28:33 threads.impure still seems to be failing sometimes 21:28:38 even with the barrier 21:29:05 any test in particular? 21:29:32 looks like the same subtypep one 21:30:01 "subtypep one"? 21:30:23 calling subtypep from different threads, causing to access its cache 21:30:30 the one i pasted earlier 21:32:59 there is no way our cache code is thread safe. 21:33:32 if it ever worked, that was an accident. 21:35:07 ok. neer a barrier around ,@(sets) in early-extensions.lisp:L619 21:35:19 *needs 21:35:49 and i can't trigger it with my earlier test anymore 21:35:53 it is all right on x86, obv. 21:36:08 and running threads.impure alone doesn't seem to trigger it 21:37:57 and a read barrier around (,args-and-values (svref ,n-cache ,n-index)) as well 21:38:13 ok. Now I'm almost tempted to resurrect alpha ;) 21:38:46 ppc is not enough of a challenge? 21:38:47 Sorry, I couldn't find anything for is not enough of a challenge?. 21:41:02 -!- sdemarre [~serge@70.122-64-87.adsl-dyn.isp.belgacom.be] has quit [Ping timeout: 264 seconds] 21:43:17 and around (setq ,var-name (make-array ,size :initial-element 0)) (L617), if we ever stop 0-initialising the heap. 21:47:28 scymtym_ [~user@ip-5-147-122-209.unitymediagroup.de] has joined #sbcl 21:48:09 (well, around make-array.) 21:51:36 now i got into ldb after :HASH-TABLE-PARALLEL-READERS 21:51:46 and before :hash-table-single-accessor-parallel-gc 21:55:03 so, *in* :hash-table-parallel-readers ? 21:55:29 it manages to print success running :Hash-Table-Parallel-Readers, but not "running :hash-table-single-accessor-parallel-gc 21:55:34 so, somewhere in between 21:58:18 what kind of ldb? 21:58:28 a non-working one 21:58:54 sure, but what was the cause? 21:59:13 it seemed like two ldb prompts popped out, presumably from two threads, but i couldn't enter anything 21:59:24 so, no idea 22:03:02 on one hand, I'm tempted to convert that code to joinable threads and a flag. on the other, it shouldn't crash this bad. 22:16:59 -!- psilord [~pkeller@23-25-144-217-static.hfc.comcastbusiness.net] has quit [Quit: Leaving.] 22:29:30 now it finished ok 22:32:42 maybe my printfs confused it somewhat, running again 22:49:43 -!- slyrus [~chatzilla@107.200.11.156] has quit [Ping timeout: 264 seconds] 22:53:42 slyrus [~chatzilla@107.200.11.156] has joined #sbcl 22:53:57 no failures this time either 22:54:04 ok, i'll blame printfs 23:01:26 -!- rpg [~rpg@c-71-63-247-69.hsd1.mn.comcast.net] has quit [Quit: rpg] 23:04:45 i think attach_os_thread doesn't work on (not c-stack-is-control-stack) 23:05:15 psilord [~psilord@c-69-180-173-249.hsd1.mn.comcast.net] has joined #sbcl 23:06:55 so, why not make c-stack-is-control-stack everywhere? 23:11:06 -!- segv- [~mb@95-91-243-229-dynip.superkabel.de] has quit [Remote host closed the connection] 23:13:01 it'll mess with precise GC 23:24:15 then attach_os_thread should create a new control-stack? 23:24:49 probably. 23:25:17 but that's only used for foreign callbacks from foreign threads (and so only on sb-safepoint builds), right? 23:25:23 right 23:25:44 3-for-3 successful complete ./run-tests.sh 23:27:24 i, of course, don't plan to run anything on ppc, so i'm not really worried about the lack of sb-safepoint, but i would care about it working on ARM 23:31:50 even parallel subtypep? 23:32:19 even subtypep 23:32:57 the only difference between these runs and the failing ones is that i was printing stuff on pthread_create and on protect_page_guard 23:33:17 perhaps concurrent access to stdio was problematic 23:33:57 subtypep is still wrong :| 23:34:49 It could randomly return 0, 0. 23:35:25 i think i've seen that failure mode 23:39:16 but i can't manage to trigger it 23:40:23 sometimes I wonder how hard it would be to adapt dthreads to sbcl ;) 23:40:54 invariably, my conclusion is "hard." 23:41:27 not like getting the current threading to work right is easy 23:42:39 it's not research, at least. 23:43:36 now that alien stack is guarded in the correct place, perhaps it'll make sense to enable hard page guards too? 23:44:00 true. 23:44:50 previously, the space for hard page guards was the alien stack 23:45:28 it was also the number stack. 23:46:04 but the number stack grew upward and with the soft page guard in the middle 23:46:22 perhaps that's why nobody noticed it, or haven't bothered reporting it 23:46:43 probably no one had deeply recursive numeric functions 23:47:16 corollary: 64K number stacks suffice. 23:47:17 or just nobody runs on ppc 23:47:57 well, there's bluegene, but that's with an unthreaded build. 23:48:04 64K? i think the number stack was larger, the c stack was 64K 23:48:24 c stack is number stack. 23:48:58 why is it called "number"? 23:49:06 that's where we punt unboxed numbers 23:49:08 *put 23:49:09 and why is it called so and none of the actual stacks are named so 23:50:08 so, binding is for special variables, control stack is function frames and alien stack is for c function frames? 23:50:24 except where control stack and alien stack are the same 23:50:34 right. 23:50:53 c function frames, or other stuff we don't want the GC to see. 23:52:40 and are these things put on the alien stack in the correct direction? 23:53:00 well, on PPC at least, number stack frames are allocated in the right direction 23:53:15 they're just addressed backward... but that's not *too* bad. 23:54:04 i need to conjure up some tests for this 23:54:11 alien stack IS number stack. 23:54:40 It's only called "alien stack" on x86oids, where numbers are stored on the control stack. 23:54:42 doing alloca(128K) should do for the current mix-up 23:55:14 also, old school advantage of stack grows downward (potentially why that choice is so popular): you can brk upward to have both heap and stack without statically allocating anything. 23:55:14 And then, to make things worse, alien code is run on the control stack anyway, so the alien stack is only used for stack-allocated alien data. 23:55:46 nyef: I'm surprised it's used at all on x86 23:55:55 A plausible choice, but brk()en on 32 bit systems, really. 23:55:57 nyef: that got me quite confused when i looked at it 23:56:31 ARM, IIRC, has alien-stack rather than number-stack, but a precisely-scavenged Lisp control stack. 23:56:42 On the other paw, ARM doesn't yet get to the point of GC. 23:58:05 no wonder it was mixed up in the first place 23:58:28 Pretty much, yeah. 23:59:18 Hence why it was so easy to miss when I did the original PPC threads bit, there were so many OBVIOUS breakages when it came to threads in the runtime that something subtle like the stacks being completely screwed up slipped entirely under the radar.