00:15:25 <pkhuong> we don't allocate the initial stack.
00:15:41 <stassats`> the problem appears to be with protection
00:16:00 <stassats`> looks like it protects the wrong thing
00:18:57 <stassats`> though the math checks out
00:19:06 <stassats`> apparently
01:21:18 -!- scymtym_ [~user@ip-5-147-122-209.unitymediagroup.de] has quit [Ping timeout: 276 seconds]
01:53:56 <stassats`> could be there a problem with pthread_attr_setstack, "All pages within the stack described by stackaddr and stacksize shall be both readable and writable by the thread.", but we're putting the guards inside that space
01:54:43 <stassats`> i changed it to thread_control_stack_size - os_vm_page_size * 2, and it doesn't seem to be crashing anymore
01:57:07 <stassats`> i also have the binding guards disabled, need to check with them intact
02:07:45 <stassats`> it does indeed solve the problem, weird
02:24:54 <pkhuong> I guess it *could* be some TLS thing...
02:27:14 <stassats`> why would it be triggered by alloca?
02:28:36 <stassats`> strange, without this - os_vm_page_size * 2 proc/maps shows the guard pages to be above [stack:45673], but with it the guard page is below
02:39:42 -!- christoph_debian [~christoph@ppp-188-174-146-202.dynamic.mnet-online.de] has quit [Ping timeout: 264 seconds]
02:46:05 <stassats`> non-working sb-dynamic-core prevents faster turn around
02:48:52 <stassats`> on x86, /proc/maps correctly identifies the relative location of page guards
02:52:14 pranavrc [~pranavrc@122.164.227.252] has joined #sbcl
02:52:14 -!- pranavrc [~pranavrc@122.164.227.252] has quit [Changing host]
02:52:14 pranavrc [~pranavrc@unaffiliated/pranavrc] has joined #sbcl
02:52:48 christoph_debian [~christoph@ppp-93-104-181-102.dynamic.mnet-online.de] has joined #sbcl
03:07:14 <stassats`> the assembly output of alloca(65536) doesn't suggest anything out of the ordinary
03:15:01 kanru [~kanru@118-163-10-190.HINET-IP.hinet.net] has joined #sbcl
03:35:16 -!- stassats` [~stassats@wikipedia/stassats] has quit [Ping timeout: 264 seconds]
04:43:21 Bike_ [~Glossina@wl-nat100.it.wsu.edu] has joined #sbcl
04:43:26 -!- Bike [~Glossina@wl-nat100.it.wsu.edu] has quit [Ping timeout: 264 seconds]
04:54:26 -!- Bike_ is now known as Bike
05:24:43 -!- pranavrc [~pranavrc@unaffiliated/pranavrc] has quit [Ping timeout: 264 seconds]
05:34:21 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl
06:08:45 -!- easye [~user@213.33.70.157] has quit [Read error: No route to host]
07:42:24 echo-area [~user@123.120.251.36] has joined #sbcl
08:13:25 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 276 seconds]
08:25:29 -!- edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has quit [Quit: lifetime expired because experience stopped]
08:40:19 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_]
08:41:46 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl
08:42:21 -!- jdz [~jdz@85.254.212.34] has quit [Quit: Byebye.]
08:46:06 jdz [~jdz@85.254.212.34] has joined #sbcl
08:48:46 -!- jdz [~jdz@85.254.212.34] has quit [Client Quit]
08:49:26 jdz [~jdz@85.254.212.34] has joined #sbcl
08:55:15 attila_lendvai [~attila_le@unaffiliated/attila-lendvai/x-3126965] has joined #sbcl
09:14:51 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_]
09:19:02 lurch_ [~lurch_@94-224-16-101.access.telenet.be] has joined #sbcl
11:08:31 yacks [~py@103.6.159.100] has joined #sbcl
11:12:51 -!- milosn_ [~milosn@user-5af507e0.broadband.tesco.net] has quit [Read error: Operation timed out]
11:18:29 milosn [~milosn@user-5af507e0.broadband.tesco.net] has joined #sbcl
11:42:25 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl
11:46:07 -!- ASau [~user@p4FF96CCC.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds]
12:12:06 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 264 seconds]
12:19:55 yacks [~py@103.6.159.100] has joined #sbcl
12:31:35 segv- [~mb@95-91-243-229-dynip.superkabel.de] has joined #sbcl
12:48:11 -!- echo-area [~user@123.120.251.36] has quit [Read error: Connection reset by peer]
12:56:53 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 240 seconds]
13:06:10 yacks [~py@103.6.159.100] has joined #sbcl
13:16:21 -!- psilord2 [~psilord@c-69-180-173-249.hsd1.mn.comcast.net] has quit [Quit: Leaving.]
13:16:55 edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has joined #sbcl
13:39:52 nyef [~nyef@pool-70-109-145-126.cncdnh.east.myfairpoint.net] has joined #sbcl
14:00:01 <pkhuong> you know, with a page size of 64k, maybe we just end up allocating a tiny tiny stack.
14:10:47 -!- foom [jknight@nat/google/x-zxkmfprfmvinmljn] has quit [Ping timeout: 240 seconds]
14:11:13 foom [jknight@nat/google/x-vstwaceowsubvllm] has joined #sbcl
14:26:28 stassats` [~stassats@wikipedia/stassats] has joined #sbcl
14:31:07 <stassats`> pkhuong: it's 512K by default and even passing different values with --control-stack-size doesn't help
14:31:33 <stassats`> but low stack doesn't explain why setting pthread_attr_setstack two pages less is fixing this
14:32:53 <stassats`> another thing is that we have an unprotected whole after the guard, since it doesn't set a hard guard for threads, so it's <stack><guard-page><unprotected-page-for-the-hard-guard>
14:33:04 <stassats`> s/whole/hole/
14:34:40 <stassats`> and i'm not sure why don't do that, the main thread stack has a hard guard
14:39:01 -!- yacks [~py@103.6.159.100] has quit [Ping timeout: 248 seconds]
14:40:10 -!- attila_lendvai [~attila_le@unaffiliated/attila-lendvai/x-3126965] has quit [Quit: Leaving.]
14:43:29 yacks [~py@103.6.158.105] has joined #sbcl
14:44:21 easye [~user@213.33.70.157] has joined #sbcl
14:55:50 <stassats`> setting a hard guard doesn't help, even more, some contribs fail to build
14:57:06 <stassats`> which is a bit odd
15:07:34 <stassats`> huh, setting stack: start: 0xf73c0000, end: 0xf7bc0000, size: 2097152 and then protecting page 0xf75a0000
15:08:33 <stassats`> control_stack_end seems to be completely wrong
15:10:07 <stassats`> and binding stack also ends up inside the control stack
15:13:44 <nyef> Are the thread stacks set up correctly for a stacks-grow-upward regime?
15:14:09 <stassats`> the direction appears to be correct
15:14:37 <nyef> Direction, starting position, and guard page location?
15:14:47 <nyef> Hrm.
15:15:03 <stassats`> guard page location is weird, it appears to be in the middle, see the above print-out
15:15:20 <stassats`> maybe the binding stack and alien stack are designed to be inside the control stack
15:15:27 <nyef> No, they're not.
15:15:58 <nyef> Well, the number stack is supposed to be the alien stack, but...
15:16:35 <stassats`> thread_control_stack_size is 2097152, but (- sb-vm:*control-stack-end* sb-vm:*control-stack-start*) is 524288
15:17:10 <nyef> n-fixnum-tag-bits?
15:19:00 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer]
15:19:00 <nyef> Yeah, that's the fact that the static symbols have aligned untagged values.
15:19:32 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl
15:21:44 <stassats`> before setting the stack th->control_stack_start+thread_control_stack_size => 0xf7510000, but th->control_stack_end is 0xf6f10000
15:22:41 <stassats`> can (lispobj*)((void*)th->control_stack_start+thread_control_stack_size); be a problem?
15:22:58 <stassats`> that's basically how th->control_stack_end is calculated
15:23:09 <stassats`> in particular, the casts
15:26:22 <pkhuong> we do arithmetic on a void*?
15:27:13 <pkhuong> ah, gcc extension.
15:32:33 <stassats`> and lispobj is typedef unsigned int lispobj;
15:37:47 <loke_> pkhuong: that example isn't void* arithmetic though. -> binds tighter than the cast
15:44:27 <stassats`> (lispobj*)((void*)th->control_stack_start+thread_control_stack_size) gives 0xf6f50000, but th->control_stack_start+thread_control_stack_size gives 0xf7550000
15:44:44 <stassats`> so, something is indeed wrong with the casts
15:46:07 <loke_> weird. You sure?
15:46:35 <loke_> -> definitely binds tighter than typecast, and as such there should be no difference
15:46:48 <stassats`> loke_: http://paste.lisp.org/display/138422
15:46:55 <stassats`> i'm not sure about anything anymore
15:50:27 <loke_> w t f
15:51:18 <loke_> what I see in your output would make sense if the typecast bound tigther (since the GCC extension allows void* arithmetic with an assumed size of 1)
15:51:36 <loke_> oh wait a second
15:51:38 <stassats`> what does bind and tight mean?
15:51:49 <loke_> it does bind tighter than ->, but not tighter than +
15:52:08 <loke_> so it's equivalent to ((void*)th->control_stack_start)+thread_control_stack_size
15:52:27 <loke_> It means higher precendence
15:52:39 <stassats`> that's what i thought
15:52:49 <loke_> a+b*c, b and c are bound by the *, tighter than the a+b
15:53:03 <stassats`> i expected it to be ((void*)th->control_stack_start)
15:53:12 <loke_> stassats`: It's a bit ugly, and at the very least it should cast to char*, not void* unles syou really _want_ to be GCC specific
15:53:26 <stassats`> control_stack_start is unsigned int *
15:53:53 <loke_> stassats`: right, bit is _size in bytes?
15:54:24 <stassats`> size is in bytes, right
15:54:31 <stassats`> does it think it's in words?
15:54:41 <stassats`> and i don't want to be gcc specific, clang is a supported compiler too
15:59:42 <loke_> stassats`: if the code works the way it is, then just change the cast to char* instead of void*, and you'll be standards compliant.
16:00:08 <loke_> GCC just falls back to size=1 (i.e. char) if you try to do arithmetic on void*
16:00:13 <loke_> stupid, in my opinion. :-)
16:00:26 <stassats`> and that's not the only place where that happens
16:01:55 <stassats`> but that's compiled with gcc, why does it break then?
16:02:37 <pkhuong> I'm not sure that the snippet does anything wrong. control stack size is measure in bytes, right?
16:02:40 <pkhuong> *measured
16:03:23 <stassats`> yes
16:05:56 <stassats`> oh, i see, th->control_stack_start+thread_control_stack_size is actually what is wrong
16:06:01 <stassats`> without casts
16:08:08 <stassats`> but that's just debug output, so it does not explain why anything is failing
16:09:22 <stassats`> though (+ #xf6f50000 2097152) still gives #xF7150000, not 0xf6f50000
16:10:39 <pkhuong> I don't see how (+ #xf6f50000 2097152) could give 0xf6f50000
16:11:47 <stassats`> copied the wrong thing, #xf6d50000, and it's correct
16:11:51 <stassats`> so, back to square one
16:13:18 <stassats`> so, everything is correct, except that it start to work when i pass thread_control_stack_size-os_vm_page_size*2 to pthread_attr_setstack instead of just thread_control_stack_size
16:20:54 <stassats`> and alloca(65536) should go near to the page guard
16:21:10 <stassats`> i'd opt in for bad pointer arithmetic instead of this conundrum
16:23:11 <stassats`> time to build a C program trying to replicate this
16:26:35 -!- ASau` is now known as ASau
16:34:14 -!- lurch_ [~lurch_@94-224-16-101.access.telenet.be] has quit [Quit: lurch_]
16:37:35 <pkhuong> stassats`: unlikely, but protecting both guard pages in init_new_threads might cause another, easier to understand, failure.
16:38:26 <stassats`> by both you mean hard and "soft"?
16:38:58 <stassats`> protecting the hard page did indeed cause sb-concurrency sb-introspect and sb-bsd-sockets to fail to build, but i didn't look any further than that
16:39:43 <stassats`> i'm building a C program which creates a thread with a custom stack and similar page guards, hopefully it can be easier to debug
16:52:02 -!- segv- [~mb@95-91-243-229-dynip.superkabel.de] has quit [Remote host closed the connection]
16:55:04 <stassats`> another interesting thing, we try to align stack space by page size, by rounding the start up
16:55:19 <stassats`> but, after the rounding, the size should grow down, which it doesn't
16:55:43 <stassats`> which can explain why subtracting os_vm_page_size*2 from  thread_control_stack_size helps
16:56:32 <pkhuong> I think there's a comment that says some struct's size is pre-padded.
16:57:04 <stassats`> yeah, even if it weren't, it allocates much more than just control stack size, so that wouldn't matter
16:57:27 pranavrc [~pranavrc@122.164.235.4] has joined #sbcl
16:57:28 -!- pranavrc [~pranavrc@122.164.235.4] has quit [Changing host]
16:57:28 pranavrc [~pranavrc@unaffiliated/pranavrc] has joined #sbcl
16:58:41 segv- [~mb@95-91-243-229-dynip.superkabel.de] has joined #sbcl
16:59:53 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer]
17:00:16 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl
17:02:20 Odyessus [~odyessus@213.47.71.36] has joined #sbcl
17:03:38 <pkhuong> stassats`: also, c-stack isn't control stack on ppc.
17:05:21 <nyef> ... Right, number (alien) stack is control stack on PPC, isn't it?
17:07:01 <nyef> Err... is c-stack, not control stack. Oops.
17:09:56 <stassats`> what is c-stack?
17:10:26 <nyef> The control (and data) stack for C code.
17:10:45 <nyef> As opposed to the control (and boxed data) stack for Lisp code.
17:11:57 -!- Odyessus [~odyessus@213.47.71.36] has quit [Quit: Colloquy for iPad - http://colloquy.mobi]
17:14:18 <stassats`> an address on the stack is printed as 0xf762eb10, with the control stack being 0xf7430000-0xf7630000, and the page guard being at 0xf7610000, binding stack starting from 0xf7630000
17:14:21 <stassats`> something is weird
17:17:53 -!- luis [~luis@kerno.org] has quit [Quit: ZNC - http://znc.sourceforge.net]
17:18:43 <nyef> Address of what?
17:18:49 <nyef> And what's the number stack range?
17:19:36 <stassats`> 0xf762eb10 is printf("%p", alloca(10));
17:19:59 <stassats`> and number stack being?
17:20:22 <nyef> Is this for SBCL or a separate test program?
17:20:38 <stassats`> that's from a shared library loaded by sbcl
17:20:52 <nyef> Okay, so the number stack or alien stack range?
17:21:30 <stassats`> there's no such thing as number_stack_start, which stack would contain it?
17:23:22 <nyef> Might be alien_stack_start in struct thread.
17:23:43 <stassats`> alien_stack_guard_page is at 0xf7810000
17:24:00 <stassats`> (that's what i have printed right now)
17:25:19 <nyef> Oh, god.
17:26:04 <nyef> Is attach_os_thread() using pthread_attr_getstack() and setting the control stack to that value?
17:26:14 <stassats`> and binding guard is at 0xf74a0000, with binding stack starting from the end of the control stack
17:26:48 <nyef> And create_os_thread using pthread_attr_setstack() with the control_stack parameters?
17:27:25 <nyef> Mea culpa, mea culpa, maxima mea culpa, this is WRONG, WRONG, WRONG, and my fault for not seeing it when I did the original PPC thread conversion.
17:28:52 <stassats`> what does attach_os_thread do?
17:29:31 <nyef> Looks like it creates a struct thread for an existing thread that doesn't have a corresponding Lisp thread structure.
17:29:56 <nyef> Probably for call-into-lisp from a thread created by an external library.
17:30:40 <stassats`> that's what i thought
17:31:28 <stassats`> so, where do you say the problem lies?
17:32:00 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer]
17:32:06 <nyef> When using pthread_attr_getstack() and pthread_attr_setstack(), it should be control_stack on x86oids only, and alien_stack elsewhere.
17:32:43 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl
17:33:00 <nyef> Actually, scratch that.
17:33:18 <nyef> The condition is control_stack on c_stack_is_control_stack systems, and alien_stack elsewhere.
17:33:45 <stassats`> ok, that's how i read it
17:33:48 <stassats`> let me try that
17:35:16 <nyef> Clearly, I need to desk-check the thread logic and PPC backend at some point, just to be sure. I can see HOW I missed this, but I can't see that it was particularly excusable given how much other thread damage there was in the runtime.
17:36:36 <stassats`> i also need to clear up things with memory barriers after allocations, just using sync maybe too heavy-weight
17:36:57 <stassats`> (without sync it crashes and burns)
17:37:07 -!- slyrus [~chatzilla@107.200.11.156] has quit [Ping timeout: 264 seconds]
17:37:23 <nyef> Look again: PPC memory barriers basically ARE all sync.
17:37:31 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Read error: Connection reset by peer]
17:38:00 <stassats`> there are different options to sync, we just need that later writes are not seen earlier than the earlier writes
17:38:13 loke_ [~elias@bb115-66-249-26.singnet.com.sg] has joined #sbcl
17:38:52 <pkhuong> stassats`: after PA, regardless of allocation, actually.
17:39:19 slyrus [~chatzilla@107.200.11.156] has joined #sbcl
17:39:47 <stassats`> can that also happen outside of it? or should we just delegate that to the user?
17:40:02 <nyef> Probably should check to make sure there are barriers in the WITHOUT-GCING and WITHOUT-INTERRUPTS sequences as well.
17:40:52 <nyef> For system consistency, you probably need barriers whenever you do something "against GC rules", which is to say whenever you do something that has to be atomic with respect to GC.
17:41:19 <nyef> Outside of that, it's the user's responsibility to ensure consistency between threads in their own code.
17:41:30 <stassats`> makes sense
17:41:38 <pkhuong> (of course, the user might be ourselves)
17:44:08 <stassats`> ok, changed it to the alien stack, but alien_stack_guard_page => 0xf7680000, and the stack is 0xf769eb10
17:44:16 <stassats`> i.e., after the guard
17:44:30 <stassats`> is this a downward/upward problem?
17:44:58 <pkhuong> I scanned the runtime for that an hour ago, and nothing jumped out.
17:45:35 <stassats`> needless to say, getaddrinfo still doesn't work
17:46:03 -!- pranavrc [~pranavrc@unaffiliated/pranavrc] has quit [Ping timeout: 276 seconds]
17:46:14 <stassats`> but and now i get INFO: Alien stack guard page unprotected, instead of the control stack
17:46:28 <nyef> Yeah, the alien stack should be growing upwards.
17:47:54 <stassats`> setting of th->alien_stack_pointer seems to be correctly protected by LISP_FEATURE_STACK_GROWS_DOWNWARD_NOT_UPWARD
17:48:43 <stassats`> hah, but the guard page is wrong
17:49:01 <stassats`> #define ALIEN_STACK_HARD_GUARD_PAGE(th)  (((os_vm_address_t)th->alien_stack_start) + ALIEN_STACK_SIZE - \ os_vm_page_size)
17:49:09 <stassats`> wait, it's
17:49:18 <stassats`> didn't notice the + ALIEN_STACK_SIZE part
17:53:14 pnpuff [~Op125@unaffiliated/pnpuff] has joined #sbcl
17:53:41 -!- pnpuff [~Op125@unaffiliated/pnpuff] has left #sbcl
17:54:28 <stassats`> and on the main thread, the stack address is not inside the alien stack, but on the default C stack
17:54:53 <pkhuong> I don't think I see where we switch stack in call_into_c.
17:55:32 <stassats`> so, it uses the one provided by pthread_attr_setstack and the default one?
17:57:51 <nyef> On PPC? We don't switch stack, Lisp uses a different stack pointer than C does.
17:58:07 <nyef> (Specifically, you should find that the Lisp reg_NSP is the C stack pointer.)
17:59:16 <pkhuong> ah, ok.
18:05:25 *stassats`* goes to sprinkle some more print statements
18:48:01 <stassats`> alien start: 0xf7370000, end: 0xf7470000, but /proc/maps says f7460000-f7720000 rwxp 00000000 00:00 0 [stack:48950]
18:49:17 <stassats`> and the address of a stack variable is 0xf746eae0, which confirms what /proc/maps says
18:55:23 <pkhuong> what does that snippet from /proc/maps hint at?
18:56:26 <stassats`> that the end and the start are mixed up
18:57:32 <stassats`> http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_attr_setstack.c;h=4bd314e66aefb6268b7757712eec87e0c9a334e3;hb=HEAD#l31 is a bit strange
18:57:46 rpg [~rpg@216.243.156.16.real-time.com] has joined #sbcl
18:58:04 <stassats`> it sets the address to  stackaddr + stacksize, which is consistent what is seen at /proc/maps, but not with what is needed
18:58:22 <stassats`> "stackaddr should point to the lowest addressable byte"
18:59:00 <stassats`> which may as well work on grows-downward
18:59:19 <pkhuong> no.
18:59:57 <pkhuong> I see what you mean.
19:01:39 <stassats`> but the math doesn't add up, it would start at 0xf7470000 if that were the case
19:02:25 <pkhuong> well. easy workaround: call setstack{addr,size} separately.
19:03:07 <stassats`> addr is deprecated
19:04:04 -!- loke_ [~elias@bb115-66-249-26.singnet.com.sg] has quit [Ping timeout: 264 seconds]
19:04:17 <pkhuong> yeah, well it doesn't do the fixup magic.
19:04:21 <stassats`> i find it hard to believe that glibc is doing something wrong, i must be missing something
19:04:47 <pkhuong> easy way to test that hypothesis: call setstackaddr directly.
19:05:34 <pkhuong> http://sourceware.org/git/?p=glibc.git;a=commit;f=nptl/pthread_attr_setstack.c;h=76a50749f7af5935ba3739e815aa6a16ae4440d1 says the fixup is intentional
19:10:00 <stassats`> so, pehrps /proc/maps is wrong
19:10:05 <stassats`> perhaps
19:13:40 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds]
19:14:11 <pkhuong> stassats`: I guess a simple test program would create a new thread and print sp.
19:20:49 loke_ [~elias@2001:470:36:b4a:589f:e2d2:cf5a:264f] has joined #sbcl
19:26:42 ASau [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl
19:28:33 <stassats`> in my c program, it appears to be below the start of the stack
19:29:38 <stassats`> http://paste.lisp.org/display/138422#1
19:36:03 <pkhuong> stassats`: and what setstackaddr?
19:37:14 <stassats`> blimey, i commented out &attr and put NULL instead
19:37:26 <stassats`> disregard that output
19:38:36 <stassats`> ok, with &attr i get stack start: 0x3fff876b0000, end: 0x3fff878b0000 stack = 0x3fff8788e710, and if i enable mprotect, it segfaults
19:38:42 <stassats`> just like in sbcl
19:43:02 <pkhuong> ok, and if you setstackaddr instead?
19:43:24 <stassats`> i get a segmentation fault without mprotect
19:43:53 <stassats`> it seems setstacksize doesn't have any effect in this case
19:44:30 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 264 seconds]
19:46:31 -!- edgar-rft [~GOD@HSI-KBW-149-172-63-75.hsi13.kabel-badenwuerttemberg.de] has quit [Quit: experience lost because computer sucks]
19:48:19 sdemarre [~serge@70.122-64-87.adsl-dyn.isp.belgacom.be] has joined #sbcl
19:50:21 ASau [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl
19:53:08 <stassats`> and again, specifying THREAD_SIZE-page_size fixes that
19:53:51 <stassats`> what a peculiar situation, i have a fix, but have no idea why it works
19:54:56 luis` [~luis@kerno.org] has joined #sbcl
19:55:07 <stassats`> i mean, fixes that crash caused by mprotect
19:58:24 <nyef> And the problem is, if you don't know why it works, you don't know that it doesn't cause other problems down the line.
19:58:31 <stassats`> exactly
19:59:03 <stassats`> however, my example seems to crash during allocate_stack of pthreads
19:59:06 <stassats`> which does       /* The user provided stack memory needs to be cleared.  */ memset (pd, '\0', sizeof (struct pthread));
19:59:33 <stassats`> pd defined as pd = (struct pthread *) ((uintptr_t) attr->stackaddr - TLS_TCB_SIZE - adj);
19:59:39 <stassats`> so, it store something at the end of the stack
19:59:41 <stassats`> stores
20:00:05 <stassats`> maybe that's why protecting it breaks, and does not break on x86, because it protects the lower part
20:00:42 <stassats`> that would explain passing the lower thread size, but in my test program it happens directly in pthread_create, while sbcl crashes only after alloca
20:00:48 -!- ASau [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Remote host closed the connection]
20:03:16 <stassats`> this bug is fun
20:07:30 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl
20:08:55 <stassats`> now i'm completely baffled, i've got stack growing downwards on ppc for me
20:13:14 <stassats`> http://www.ibm.com/developerworks/linux/library/l-powasm4/index.html says "stack grows downward"
20:13:21 <stassats`> whaaaa
20:17:31 <stassats`> glibc: /* On PPC the stack grows down.  */ #define _STACK_GROWS_DOWN	1
20:17:31 <stassats`>  
20:17:34 <stassats`> i demand a refund
20:17:48 <nyef> ... Okay, what does SB-ALIEN and number-stack do on PPC?
20:18:11 <nyef> SB-ALIEN would be src/compiler/target/ppc/c-call.lisp.
20:18:17 <nyef> Err...
20:18:24 <nyef> src/compiler/ppc/c-call.lisp.
20:18:30 <stassats`> do as in?
20:18:46 <nyef> There should be a VOP for allocating stack space, does it go up or down?
20:19:15 <nyef> There should also be a VOP somewhere used for allocating number stack space, possibly in ppc/call.lisp, does THAT go up or down?
20:20:43 <nyef> Less definitive but still strongly indicative, in ppc/insts.lisp, are TNs with a number-stack SB emitted with a positive or a negative offset, and from what register?
20:22:05 <pkhuong> I think machine code is really agnostic about this (calls work via the link register, right?)
20:22:19 <pkhuong> so we might grow upward in lisp when C grows downward :\
20:23:04 <stassats`> and the problem manifests itself when we set up the guard pages and the like with the idea that it grows upward
20:23:10 <Krystof> this is hilarious
20:23:13 <nyef> Yeah, but if SBCL is doing one thing with the number stack while C does another then we have a systemic problem, and plausibly have to define separate direction conditionals for each stack.
20:23:30 <Krystof> somewhere, Steve Jobs is laughing at us
20:24:15 <nyef> Krystof: Isn't it, though? And yet, it's still indicative of a failure in our development practice.
20:24:49 <Krystof> well now.  That's assuming we have a development practice :-)
20:25:03 <stassats`> he knew all along to abandon PPC
20:25:10 <Krystof> I would describe it more as a series of uncoordinated events
20:25:21 <pkhuong> yeah. stack grows upward on PPC/Lisp
20:25:41 <pkhuong> stassats`: actually, the problem manifests itself as C code randomly overwriting our number stack :|
20:26:45 <nyef> Mmm. C code screwing our number stack, and then number stack usage screwing C stack frames...
20:26:57 <pkhuong> LOADW is (lwz object base (- byte-offset lowtag)), and load-stack-tn is (loadw reg cfp-tn offset)
20:27:12 <stassats`> how come i'm the first one to notice that?
20:27:18 <nyef> And is LOAD-STACK-TN used for number-stack TNs?
20:27:46 <pkhuong> no. but it's a copy paste
20:28:01 <nyef> And is the NFP for a frame taken before or after adjusting the NSP?
20:28:11 <nyef> I'm thinking probably before, but I'm not certain.
20:28:22 <pkhuong> (not sure if the easiest fix is to make the number stack grow downward or make everything grow downward ;)
20:28:38 <pkhuong> wanna best stack grows upward on darwin or some such?
20:29:01 <Krystof> no bet
20:29:05 <nyef> I don't want to see the easiest fix, I want to see the fix that gives us the most flexibility to adapt to crazy stack regimes going forward.
20:29:12 <pkhuong> s/best/bet/
20:29:29 <Krystof> #!+stack-grows-both-upwards-and-downwards-and-maybe-sideways
20:29:32 <pkhuong> see, I don't want crazy stack regimes ;)
20:30:07 <nyef> #!+heap-allocated-number-stack-frames ?
20:31:05 <pkhuong> nfp grows down.
20:31:08 <pkhuong> oh dear.
20:31:15 <stassats`> i predicted that it would be something stupid, but on the third day i was convinced that it couldn't be, the fifth day eventually confirmed that it is indeed something stupid
20:31:48 <stassats`> always looking at it with the assumption that it grows upward made finding it quite difficult
20:31:52 <pkhuong> "compute-old-nfp does (inst addi result nfp [number-stack-frame-size])
20:31:52 <nyef> NFP on PPC needs to grow down.
20:32:21 <pkhuong> yeah. ok. so it's actually correct, it's "just" our page protection logic that's messed up?
20:32:26 <nyef> Is it just that the guard pages are located stupidly, then?
20:33:57 <stassats`> do we use the alien stack to allocate things for with-alien?
20:34:45 <Krystof> stassats`: look on the bright side.  On the sixth day, it's fixed, and you can rest on the seventh day
20:35:10 *nyef* applauds.
20:35:13 <stassats`> the stack grows downward on os x ppc
20:35:59 <nyef> So, proposal: Rename CONTROL-STACK-GROWS-DOWNWARD-NOT-UPWARD to CONTROL-STACK-GROWS-DOWNWARD (one commit). Add ALIEN-STACK-GROWS-DOWNWARD, defined on PPC and x86oids (one commit). Use ALIEN-STACK-GROWS-DOWNWARD to correctly set the alien-stack guard pages (one commit).
20:36:54 <stassats`> and change pthread_create to use alien-stack on non-alien-stack-control-stacks
20:36:59 <nyef> And sorting out the issue with setting the C stack (alien stack) to overlap the control stack should be a separate commit as well, yes.
20:37:18 <nyef> Possibly two commits, one for attach and one for create.
20:37:50 <nyef> Does the initial thread go via attach, or is there a third place to worry about?
20:39:08 <stassats`> does alien stack grow downward only on x86oids and ppc?
20:39:25 <pkhuong> nyef: also, I think this means unboxed DX allocation is extra hard on PPC.
20:40:00 <nyef> pkhuong: I wouldn't be surprised... do we do unboxed DX allocation on PPC yet?
20:40:17 <nyef> If so, it needs to be desk-checked, and possibly test-cases written.
20:40:30 <pkhuong> I don't think so. stack grow downward, but we address TN by adding an offset from NFP, so tricks with NSP are looking pretty hairy.
20:40:54 <nyef> By offset from NFP, so tricks with NSP are looking pretty safe?
20:41:21 <pkhuong> the offset is in the wrong direction though, so we can't easily restore NSP.
20:41:40 <pkhuong> i.e. NFP isn't anything like ONSP.
20:42:03 <stassats`> on SPARC, stack grows downwards as well, at least that's how glibc defines it
20:44:01 -!- ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has quit [Ping timeout: 256 seconds]
20:46:25 <stassats`> and i guess because it grew downward, and lisp stack grew upward, it worked with the both stacks being inside the control stack
20:46:53 <stassats`> fsvo "worked"
20:47:22 <pkhuong> nyef: but I think we're actually good there. We end up doing nsp <- nfp + frame_size.
20:54:48 ASau` [~user@p5797EA21.dip0.t-ipconnect.de] has joined #sbcl
20:54:53 <stassats`> what's the benefit of the control stack growing upward?
20:55:08 <stassats`> when the c stack grows the opposite direction
20:55:22 <pkhuong> marginally easier to think about.
20:56:42 <stassats`> too bad mips and sparc machines of the gcc compile farm are down currently
20:59:51 ehaliewicz [~user@50-0-51-11.dsl.static.sonic.net] has joined #sbcl
20:59:59 <stassats`> the alloca test now passes
21:00:41 <stassats`> and slime with a dedicated output stream (a.k.a. the addrinfo test)
21:00:52 <pkhuong> incredible.
21:01:09 <pkhuong> that and barrier, and ppc is almost usable?
21:01:15 <stassats`> right
21:03:26 <stassats`> frlock.1 test of sb-concurrency still fails intermittently
21:05:26 <pkhuong> how does it fail? timeout?
21:07:07 <stassats`> wrong expected values
21:07:38 <stassats`> looks like an ordering issue
21:09:35 <pkhuong> what value do we receive?
21:10:10 -!- rpg [~rpg@216.243.156.16.real-time.com] has quit [Quit: rpg]
21:11:13 <stassats`> r-e! is non-nil
21:11:47 <pkhuong> oh. I see.
21:11:53 <pkhuong> frlock are borked.
21:12:11 <stassats`> unless (eql c (+ a b)) fails
21:12:39 <pkhuong> frlock-read-begin should loop while a writer holds the write mutex.
21:14:03 <pkhuong> no. it does something with two counters instead.
21:17:48 <pkhuong> stassats`: in release-frlock-write-lock, can you move the :write barrier to the beginning of the function?
21:19:08 <stassats`> doesn't help
21:20:34 <pkhuong> and move the read barrier to the end of frlock-read-begin (+ m-v-prog1)
21:20:59 <pkhuong> erh, or just wrap the values form in barrier
21:22:28 <pkhuong> the pairing was all wrong.
21:22:59 <stassats`> no failures in 10 runs so far
21:24:46 rpg [~rpg@c-71-63-247-69.hsd1.mn.comcast.net] has joined #sbcl
21:28:33 <stassats`> threads.impure still seems to be failing sometimes
21:28:38 <stassats`> even with the barrier
21:29:05 <pkhuong> any test in particular?
21:29:32 <stassats`> looks like the same subtypep one
21:30:01 <pkhuong> "subtypep one"?
21:30:23 <stassats`> calling subtypep from different threads, causing to access its cache
21:30:30 <stassats`> the one i pasted earlier
21:32:59 <pkhuong> there is no way our cache code is thread safe.
21:33:32 <pkhuong> if it ever worked, that was an accident.
21:35:07 <pkhuong> ok. neer a barrier around ,@(sets) in early-extensions.lisp:L619
21:35:19 <pkhuong> *needs
21:35:49 <stassats`> and i can't trigger it with my earlier test anymore
21:35:53 <pkhuong> it is all right on x86, obv.
21:36:08 <stassats`> and running threads.impure alone doesn't seem to trigger it
21:37:57 <pkhuong> and a read barrier around (,args-and-values (svref ,n-cache ,n-index)) as well
21:38:13 <pkhuong> ok. Now I'm almost tempted to resurrect alpha ;)
21:38:46 <stassats`> ppc is not enough of a challenge?
21:38:47 <specbot> Sorry, I couldn't find anything for is not enough of a challenge?.
21:41:02 -!- sdemarre [~serge@70.122-64-87.adsl-dyn.isp.belgacom.be] has quit [Ping timeout: 264 seconds]
21:43:17 <pkhuong> and around (setq ,var-name (make-array ,size :initial-element 0)) (L617), if we ever stop 0-initialising the heap.
21:47:28 scymtym_ [~user@ip-5-147-122-209.unitymediagroup.de] has joined #sbcl
21:48:09 <pkhuong> (well, around make-array.)
21:51:36 <stassats`> now i got into ldb after :HASH-TABLE-PARALLEL-READERS
21:51:46 <stassats`> and before :hash-table-single-accessor-parallel-gc
21:55:03 <pkhuong> so, *in* :hash-table-parallel-readers ?
21:55:29 <stassats`> it manages to print success running :Hash-Table-Parallel-Readers, but not "running :hash-table-single-accessor-parallel-gc
21:55:34 <stassats`> so, somewhere in between
21:58:18 <pkhuong> what kind of ldb?
21:58:28 <stassats`> a non-working one
21:58:54 <pkhuong> sure, but what was the cause?
21:59:13 <stassats`> it seemed like two ldb prompts popped out, presumably from two threads, but i couldn't enter anything
21:59:24 <stassats`> so, no idea
22:03:02 <pkhuong> on one hand, I'm tempted to convert that code to joinable threads and a flag. on the other, it shouldn't crash this bad.
22:16:59 -!- psilord [~pkeller@23-25-144-217-static.hfc.comcastbusiness.net] has quit [Quit: Leaving.]
22:29:30 <stassats`> now it finished ok
22:32:42 <stassats`> maybe my printfs confused it somewhat, running again
22:49:43 -!- slyrus [~chatzilla@107.200.11.156] has quit [Ping timeout: 264 seconds]
22:53:42 slyrus [~chatzilla@107.200.11.156] has joined #sbcl
22:53:57 <stassats`> no failures this time either
22:54:04 <stassats`> ok, i'll blame printfs
23:01:26 -!- rpg [~rpg@c-71-63-247-69.hsd1.mn.comcast.net] has quit [Quit: rpg]
23:04:45 <stassats`> i think attach_os_thread doesn't work on (not c-stack-is-control-stack)
23:05:15 psilord [~psilord@c-69-180-173-249.hsd1.mn.comcast.net] has joined #sbcl
23:06:55 <stassats`> so, why not make c-stack-is-control-stack everywhere?
23:11:06 -!- segv- [~mb@95-91-243-229-dynip.superkabel.de] has quit [Remote host closed the connection]
23:13:01 <pkhuong> it'll mess with precise GC
23:24:15 <stassats`> then attach_os_thread should create a new control-stack?
23:24:49 <pkhuong> probably.
23:25:17 <pkhuong> but that's only used for foreign callbacks from foreign threads (and so only on sb-safepoint builds), right?
23:25:23 <stassats`> right
23:25:44 <stassats`> 3-for-3 successful complete ./run-tests.sh
23:27:24 <stassats`> i, of course, don't plan to run anything on ppc, so i'm not really worried about the lack of sb-safepoint, but i would care about it working on ARM
23:31:50 <pkhuong> even parallel subtypep?
23:32:19 <stassats`> even subtypep
23:32:57 <stassats`> the only difference between these runs and the failing ones is that i was printing stuff on pthread_create and on protect_page_guard
23:33:17 <stassats`> perhaps concurrent access to stdio was problematic
23:33:57 <pkhuong> subtypep is still wrong :|
23:34:49 <pkhuong> It could randomly return 0, 0.
23:35:25 <stassats`> i think i've seen that failure mode
23:39:16 <stassats`> but i can't manage to trigger it
23:40:23 <pkhuong> sometimes I wonder how hard it would be to adapt dthreads to sbcl ;)
23:40:54 <pkhuong> invariably, my conclusion is "hard."
23:41:27 <stassats`> not like getting the current threading to work right is easy
23:42:39 <pkhuong> it's not research, at least.
23:43:36 <stassats`> now that alien stack is guarded in the correct place, perhaps it'll make sense to enable hard page guards too?
23:44:00 <pkhuong> true.
23:44:50 <stassats`> previously, the space for hard page guards was the alien stack
23:45:28 <pkhuong> it was also the number stack.
23:46:04 <stassats`> but the number stack grew upward and with the soft page guard in the middle
23:46:22 <stassats`> perhaps that's why nobody noticed it, or haven't bothered reporting it
23:46:43 <pkhuong> probably no one had deeply recursive numeric functions
23:47:16 <pkhuong> corollary: 64K number stacks suffice.
23:47:17 <stassats`> or just nobody runs on ppc
23:47:57 <pkhuong> well, there's bluegene, but that's with an unthreaded build.
23:48:04 <stassats`> 64K? i think the number stack was larger, the c stack was 64K
23:48:24 <pkhuong> c stack is number stack.
23:48:58 <stassats`> why is it called "number"?
23:49:06 <pkhuong> that's where we punt unboxed numbers
23:49:08 <pkhuong> *put
23:49:09 <stassats`> and why is it called so and none of the actual stacks are named so
23:50:08 <stassats`> so, binding is for special variables, control stack is function frames and alien stack is for c function frames?
23:50:24 <stassats`> except where control stack and alien stack are the same
23:50:34 <pkhuong> right.
23:50:53 <pkhuong> c function frames, or other stuff we don't want the GC to see.
23:52:40 <stassats`> and are these things put on the alien stack in the correct direction?
23:53:00 <pkhuong> well, on PPC at least, number stack frames are allocated in the right direction
23:53:15 <pkhuong> they're just addressed backward... but that's not *too* bad.
23:54:04 <stassats`> i need to conjure up some tests for this
23:54:11 <nyef> alien stack IS number stack.
23:54:40 <nyef> It's only called "alien stack" on x86oids, where numbers are stored on the control stack.
23:54:42 <stassats`> doing alloca(128K) should do for the current mix-up
23:55:14 <pkhuong> also, old school advantage of stack grows downward (potentially why that choice is so popular): you can brk upward to have both heap and stack without statically allocating anything.
23:55:14 <nyef> And then, to make things worse, alien code is run on the control stack anyway, so the alien stack is only used for stack-allocated alien data.
23:55:46 <pkhuong> nyef: I'm surprised it's used at all on x86
23:55:55 <nyef> A plausible choice, but brk()en on 32 bit systems, really.
23:55:57 <stassats`> nyef: that got me quite confused when i looked at it
23:56:31 <nyef> ARM, IIRC, has alien-stack rather than number-stack, but a precisely-scavenged Lisp control stack.
23:56:42 <nyef> On the other paw, ARM doesn't yet get to the point of GC.
23:58:05 <stassats`> no wonder it was mixed up in the first place
23:58:28 <nyef> Pretty much, yeah.
23:59:18 <nyef> Hence why it was so easy to miss when I did the original PPC threads bit, there were so many OBVIOUS breakages when it came to threads in the runtime that something subtle like the stacks being completely screwed up slipped entirely under the radar.