00:07:02 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_] 00:09:45 Fare [fare@nat/google/x-yobucpodnqfvlyxa] has joined #ccl 01:05:03 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl 01:15:48 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_] 02:19:57 -!- milanj [~milanj_@178-223-170-236.dynamic.isp.telekom.rs] has quit [Ping timeout: 244 seconds] 02:24:53 milanj [~milanj_@178-223-141-91.dynamic.isp.telekom.rs] has joined #ccl 02:57:49 -!- DataLinkDroid [~DataLink@1.132.156.200] has quit [Ping timeout: 260 seconds] 03:00:03 DataLinkDroid [~DataLink@1.132.156.200] has joined #ccl 03:05:23 bfulgham_ [~brent@cpe-76-173-170-144.socal.res.rr.com] has joined #ccl 03:15:16 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl 03:24:14 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_] 03:30:06 -!- DataLinkDroid [~DataLink@1.132.156.200] has quit [Quit: Bye] 04:06:17 -!- fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has quit [Remote host closed the connection] 04:06:27 fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has joined #ccl 04:22:58 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.] 04:24:33 sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has joined #ccl 04:37:59 -!- Fare is now known as Fare_afk 04:40:54 -!- bfulgham_ [~brent@cpe-76-173-170-144.socal.res.rr.com] has quit [Quit: bfulgham_] 05:20:04 -!- milanj [~milanj_@178-223-141-91.dynamic.isp.telekom.rs] has quit [Quit: Leaving] 07:15:51 -!- Fare_afk is now known as Fare 07:31:55 ZabaQ [~jconnors@85.207.11.34] has joined #ccl 07:32:04 I've been trying to find the ccl changelog. 07:32:25 ccl has stopped building with the tdm version of mingw..I'm trying to pinpoint the breaking change. 07:32:57 Actually, I'm an idiot, I've got Tortoise on this machine.. 07:34:08 Hmm. The log messages are mostly "update kernel/binaries.." 08:05:56 # This makefile is written to be used with the mingw toolchain 08:05:56 # running in the Cygwin environment. 08:05:58 !! 08:06:08 I've never used Cygwin to compile it..msys was fine.. 08:06:20 ..but it doesn't look like ccl has changed. 08:06:27 So, mingw probably has. 08:16:59 *ZabaQ* is now totally confused as to how he ever managed to build ccl 08:18:28 -!- kanru_ [~kanru@kanru-1-pt.tunnel.tserv15.lax1.ipv6.he.net] has quit [Quit: ZNC - http://znc.sourceforge.net] 08:21:29 kanru_ [~kanru@kanru-1-pt.tunnel.tserv15.lax1.ipv6.he.net] has joined #ccl 08:29:08 Well, it looks like tdm mingw64 and cygwin mingw 64 have exactly the same flags, so if I just change the executable name it should just work. 09:33:16 -!- billstclair [~billstcla@unaffiliated/billstclair] has quit [Ping timeout: 252 seconds] 09:33:40 -!- Quadrescence [~quad@unaffiliated/quadrescence] has quit [Ping timeout: 248 seconds] 09:34:24 Quadrescence [~quad@unaffiliated/quadrescence] has joined #ccl 11:05:46 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.] 11:38:36 billstclair [~billstcla@p-216-227-82-151.dsl1.rtr.chat.fpma.frpt.net] has joined #ccl 11:38:36 -!- billstclair [~billstcla@p-216-227-82-151.dsl1.rtr.chat.fpma.frpt.net] has quit [Changing host] 11:38:36 billstclair [~billstcla@unaffiliated/billstclair] has joined #ccl 13:09:11 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl 13:11:38 alms__ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl 13:13:19 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout: 246 seconds] 13:13:32 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl 13:14:43 -!- alms_ [alms@clozure-E1F692DA.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout] 13:16:07 -!- alms__ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout: 240 seconds] 14:00:38 ..no such luck :( 14:02:52 ZabaQ: I wish the Windows toolchains would stop changing out from underneath us. I'm not even sure if those instructions at the top of the Makefile still work. 14:06:16 The instructions at http://trac.clozure.com/ccl/wiki/WindowsNotes worked at one time... 14:09:13 There are several different mingw distributions floating around, too :( 14:12:29 The checked-in lisp kernel binaries should generally be fairly up-to-date, so you can often get away without rebuilding the lisp kernel, and just rebuild the lisp heap image with (rebuild-ccl :clean t) 14:13:59 If you want to create a Trac ticket about this and include version info, build output, etc., I can try to take a look. 14:27:21 if I don't rebuild the kernel I get Error: Errors (:ERROR 2) reloading boot image: reloading the inital boot image 14:27:45 if I do rebuild the kernel I get Error: Errors (:ERROR 3) and a stack dump. 14:28:03 Is there any way I can get more info about what these errors represent? 14:29:41 I think they're just values from errno.h; so 2 would be ENOENT, and 3 would be ESRCH. 14:30:07 Is CCL_DEFAULT_DIRECTORY set in your environment? You might try unsetting it if so. 14:52:31 no, nice try. 14:58:17 Or in other words, does (probe-file "ccl:") look right? Any junk in your init file? 14:58:30 *rme* guesses wildly 15:02:13 Hmm. Might be picking up stuff in init file.. 15:02:26 ..don't worry about it until I've had more chance to investigate it. 15:13:36 Nah, not the init file. 15:15:40 http://paste.lisp.org/display/130382 15:22:42 Annotated with backtrace http://paste.lisp.org/display/130382#1 15:23:11 sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has joined #ccl 15:24:38 That's with the lisp kernel you built, yes? I assume it works to do (rebuild-ccl :clean t) from a fresh checkout. 15:25:30 I'll try that: but last time I tried, I got ERROR 2 15:30:54 bfulgham_ [~brent@76.173.170.144] has joined #ccl 15:32:33 ..this happens.. 15:32:35 http://paste.lisp.org/display/130382#2 15:33:31 I'm using oldwx86cl.exe -n -I oldwx86cl.image to launch the image.. 15:33:48 Is this is the trunk, or the 1.8 release branch? 15:34:31 1.8 release branch. 15:34:46 Should I try trunk? 15:35:56 You can certainly try the trunk, but if the 1.8 release branch can't rebuild itself, that's not so good. 15:36:12 gcc --version reports gcc (tdm64-1) 4.6.1 15:44:42 I get the same result with the 64 bits version 15:51:25 -!- rme [rme@7A1D4CD0.F26E1912.699BA7A6.IP] has quit [Ping timeout] 15:52:34 -!- rme [~rme@50.43.156.12] has quit [Ping timeout: 246 seconds] 15:54:31 rme [~rme@50.43.135.11] has joined #ccl 15:57:04 rme_ [~rme@50.43.144.163] has joined #ccl 15:57:22 -!- rme [rme@6A97116A.80B03224.699BA7A6.IP] has quit [Ping timeout] 15:57:23 -!- rme_ is now known as rme 15:58:44 -!- rme [~rme@50.43.135.11] has quit [Ping timeout: 248 seconds] 15:58:45 -!- rme_ is now known as rme 16:06:23 ZabaQ: From a fresh checkout of 1.8, starting up with wx86cl.exe -n and then doing (rebuild-ccl :clean t) works for me. 16:07:32 what is your gcc --version? 16:08:58 the "error: error(2)" looks like it's coming from RUN-PROGRAM. REBUILD-CCL will try to run wx86cl.exe; does that file exist, or was it renamed ? 16:12:07 -!- Fare [fare@nat/google/x-yobucpodnqfvlyxa] has quit [Ping timeout: 240 seconds] 16:13:06 rme_ [~rme@50.43.147.141] has joined #ccl 16:13:28 -!- rme [~rme@50.43.144.163] has quit [Read error: Operation timed out] 16:13:28 -!- rme_ is now known as rme 16:13:44 -!- rme [rme@21589A60.B02E95DC.699BA7A6.IP] has quit [Ping timeout] 16:16:08 No, it does not exist. 16:17:25 I renamed wx86cl.exe and wx86cl.image before building as I understood this was required on windows due to not being able to write to a running image. 16:21:06 aha 16:21:14 d'oh 16:21:23 REBUILD-CCL :FULL T would try to write to the kernel; :CLEAN T avoids that. 16:22:03 so, yes, I tried it with the kernel in place with the old name. no problemo. 16:22:36 It's confusing. 16:31:00 indeed 16:36:54 milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has joined #ccl 16:47:14 rme_ [~rme@50.43.133.104] has joined #ccl 16:47:52 -!- rme [rme@C3BA3E18.5174A474.699BA7A6.IP] has quit [Ping timeout] 16:47:53 -!- rme_ is now known as rme 16:48:56 Fare [fare@nat/google/x-pswsyyfwwljfcgiy] has joined #ccl 16:49:24 -!- rme [~rme@50.43.147.141] has quit [Ping timeout: 248 seconds] 16:49:58 -!- Fare [fare@nat/google/x-pswsyyfwwljfcgiy] has quit [Client Quit] 16:50:04 rme [~rme@50.43.138.72] has joined #ccl 16:51:04 -!- rme [rme@BFABE2C8.B66D9153.699BA7A6.IP] has quit [Ping timeout] 16:51:05 -!- rme_ is now known as rme 16:51:23 -!- rme_ [~rme@50.43.133.104] has quit [Ping timeout: 245 seconds] 17:02:00 -!- fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has quit [Remote host closed the connection] 17:02:09 fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has joined #ccl 17:08:06 maxm- [~user@unaffiliated/maxm] has joined #ccl 17:08:11 -!- maxm [~user@unaffiliated/maxm] has quit [Read error: Connection reset by peer] 17:12:21 -!- bfulgham_ [~brent@76.173.170.144] has quit [Quit: bfulgham_] 17:13:31 -!- maxm- is now known as maxm 17:20:51 -!- rme [rme@8BF4C302.BBBE3EE8.699BA7A6.IP] has quit [Ping timeout] 17:22:22 -!- rme [~rme@50.43.138.72] has quit [Ping timeout: 255 seconds] 17:26:41 rme_ [~rme@50.43.162.141] has joined #ccl 17:28:58 -!- milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has quit [Ping timeout: 251 seconds] 17:55:34 Thanks for the help,. 17:56:03 -!- ZabaQ [~jconnors@85.207.11.34] has quit [Quit: Leaving.] 18:09:18 milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has joined #ccl 18:15:04 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.] 18:29:14 Fare [fare@nat/google/x-wcwnsingfzoyyhnx] has joined #ccl 19:31:51 hi. When a ccl is caught in a deadlock, how do I get from the address to the lisp lock object and/or name of the variable that holds the lock object? 19:32:08 (say I can see the address from strace and can attach a gdb) 19:45:37 If you can get a core dump, there is stuff in CCL that understands the structure of the lisp heap from a core dump, so you could search for a symbol whose value cell contains whatever... 19:45:58 the process is not hosed, but one of its thread is busy polling or some such 19:46:16 You can get a core dump of a live process, no? 19:46:29 without killing it? 19:46:47 I don't know, I always thought you could, but I guess I haven't actually ever done so... 19:47:58 I suspect one process to be busy polling for a lock, for there's incessant lock activity 19:49:32 http://www.gsp.com/cgi-bin/man.cgi?section=1&topic=gcore 19:50:33 Fare: You think you have the address of a lisp lock object in gdb? 19:50:36 -!- rme_ is now known as rme 19:50:47 rme: yes 19:50:50 strace gave it to me 19:51:30 (it's in production, so no way the image is getting out of a restricted access area) 19:52:05 Then x/1g $addr - 13 will print out the uvector header for that object. If it's a lock object, the low byte will be 118. 19:52:18 (aka x8664::subtag-lock) 19:52:31 Do you have a working repl? 19:53:43 grep -v 'futex\|nanosleep' strace.log | wc -l 19:53:43 10359 19:53:44 wc -l < strace.log 19:53:44 15806145 19:53:44 I say the problem is with a lock 19:54:09 no repl, but can attach gdb 19:54:13 can I go from gdb to repl? 19:54:30 (not responding to requests, so can't use one to start a slime) 19:55:41 at that point, I'm putting my money on a race condition in our application, possibly related to logging, but probably not since there's still some logging happening. 20:00:41 21120 futex(0x5b438f0, 0x189 /* FUTEX_??? */, 0, {1341343718, 175128}, ffffffff) = -1 ETIMEDOUT (Connection timed out) 20:02:02 x/10g 0x5b438e0 20:02:03 0x5b438e0: 2 49 20:02:03 0x5b438f0: 549755813888 1 20:02:03 0x5b43900: 8589934593 97293968 20:02:03 0x5b43910: 48 657 20:02:03 0x5b43920: -4294967283 4294967295 20:28:47 that doesn 20:30:46 that doesn't look like a futex syscall; 189 doesn't look like a constant defined in /usr/include/linux/futex.h and doesn't look like a plausible LOGIOR of some constants. 20:33:15 we established that it was the thread in a signal handler waiting for tcr->resume 20:35:27 hex. It's FUTEX_CLOCK_REALTIME|FUTEX_PRIVATE_FLAG|FUTEX_TRYLOCK_PI|FUTEX_WAKE 20:36:05 Like I said, not a plausible combination of flags. 20:39:19 :-/ 20:39:25 where do we go from there? 20:46:44 Is that the only thread waiting to be resumed ? 20:48:44 looks like so 20:49:01 and it's eat a lot of cpu 20:50:08 if it's really waiting for tcr->resume, it shouldn't be busy-waiting. 20:51:03 then another thread must be busy-waiting -- how do i tell/ 20:52:29 it might be that some corruption caused this futex to be called with the wrong arguments, thereby hosing it. 20:52:39 then we'd have to find the source of the corruption 21:09:37 When the GC runs in some arbitrary thread, it sends signals to all other threads; they enter a signal handler and wait for their "resume" semaphores (tcr->resume) to be signaled, which should happen when the GC is finished. For some unknown reason, this thread seems to still be waiting to be told to resume. 21:23:37 is this semaphore usually used by anything but the gc? 21:26:25 bzzbzz [~franco@modemcable151.155-57-74.mc.videotron.ca] has joined #ccl 21:26:28 The futex syscall may be a red herring. If the thread's waiting for tcr->resume, it should have called sem_wait_forever(), which in 1.8 (IIRC) does a repeated sem_timedwait() with a timeout of 1 second. If so, you might be able to get the sem_timedwait() to return 0 (by setting $rax to 0 after the syscall) and that should cause the thread to wake up. 21:27:50 It's rarely used by anything else. There's a way to suspend other threads from lisp (that uses the same mechanism) but it's dangerous and I don't think it's used in CCL anywhere. 21:35:06 cat task/*/stat | cut -d' ' -f1,14,15,16,17,18 says that thread is taking most of the cpu. 21:35:10 I lean towards dumping a core, saving it for analysis, and restarting a new process 21:39:13 though I feel queasy about being unable to diagnose what went wrong. 21:39:36 is there any diagnostic we can run while it's alive that we won't be able to from a core? 21:42:03 what is the safe way to trigger a core dump? 21:42:07 one thing that may or may not be helpful is to see what other threads are doing lisp-wise, but what I'm thinking of would try to write to the standard error stream. Can you see that output ? 21:43:22 I think that you can detach GDB and send the process some signal that's defined to dump core. 21:43:23 it's a pipe, probably redirected to one or our log files 21:44:06 There's a gcore command that's supposed to generate a core for a running process. 21:46:28 gbyers: can we do the lisp introspection somehow? 21:46:53 after i dump core, it might be too late 21:49:11 what I was thinking of was selecting each other thread in GDB - by "(gdb) thread N" for n=1.. and in each thread doing "(gdb) call plbt_sp($rbp)". That'll try to do a lisp backtrace for each thread. That may or may not be interesting. 21:50:30 ok 22:05:01 looking at where the daemonizer logs the stderr, I see backtraces that mention the SIGQUIT-handler 22:05:50 I never remember what dumps core and what doesn't. 22:08:31 oh, that was a different process 22:10:35 in thread 1, 1 Thread 0x7f2cccb29880 (LWP 17403) 0x00007f2ccc28786d in nanosleep () 22:10:35 at ../sysdeps/unix/syscall-template.S:82 22:10:49 (gdb) call plbt_sp($rbp) 22:10:49 [New Thread 0x7f2cb7fff700 (LWP 12267)] 22:10:50 [Thread 0x7f2cbe86a700 (LWP 32343) exited] 22:11:30 (gdb) thread 2 22:11:30 [Switching to thread 2 (Thread 0x7f2cbf91c700 (LWP 21164))]#0 0x00007f2cc97b7994 in ttcpip () 22:11:30 from /usr/lib/libclntsh.so.11.1 22:11:30 (gdb) call plbt_sp($rbp) 22:11:30 [Thread 0x7f2cb730f700 (LWP 32344) exited] 22:11:31 [Thread 0x7f2cc02e0700 (LWP 17867) exited] 22:11:33 [Thread 0x7f2cbef58700 (LWP 17877) exited] 22:11:35 [Thread 0x7f2cbdea6700 (LWP 21117) exited] 22:11:37 [Thread 0x7f2cbd916700 (LWP 21120) exited] 22:11:39 [Thread 0x7f2cbd386700 (LWP 21121) exited] 22:11:41 [Thread 0x7f2cbf91c700 (LWP 21164) exited] 22:11:43 ^C 22:11:45 oops 22:11:56 I'm stuck :-( 22:13:24 -!- milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has quit [Quit: Leaving] 22:13:29 If you just did ^C, you should be back in GDB. You may see some lisp backtraces in whatever log file. 22:13:53 oh well, I killed it wrong 22:13:53 I got gdb stuck trying to inspect the wrong thread (from oracle) 22:14:06 then trying to kill gdb itself, i killed the lisp process :-( 22:14:56 sigh. 22:15:04 well, thanks a lot for your help 22:15:05 Oh well. Has this happened before to your knowledge ? 22:15:21 not to my knowledge -- not this kind of getting hosed. 22:15:35 Good news and bad. 22:15:51 sorry I bungled it 22:16:33 I guess "if it happens again, let us know". I don't have a good theory. 22:51:29 -!- Fare [fare@nat/google/x-wcwnsingfzoyyhnx] has quit [Ping timeout: 265 seconds] 22:58:24 the CPU-hogging futex loop was caused by the leap second bug in the linux kernel 23:51:22 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_]