00:07:02 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_]
00:09:45 Fare [fare@nat/google/x-yobucpodnqfvlyxa] has joined #ccl
01:05:03 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl
01:15:48 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_]
02:19:57 -!- milanj [~milanj_@178-223-170-236.dynamic.isp.telekom.rs] has quit [Ping timeout: 244 seconds]
02:24:53 milanj [~milanj_@178-223-141-91.dynamic.isp.telekom.rs] has joined #ccl
02:57:49 -!- DataLinkDroid [~DataLink@1.132.156.200] has quit [Ping timeout: 260 seconds]
03:00:03 DataLinkDroid [~DataLink@1.132.156.200] has joined #ccl
03:05:23 bfulgham_ [~brent@cpe-76-173-170-144.socal.res.rr.com] has joined #ccl
03:15:16 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl
03:24:14 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_]
03:30:06 -!- DataLinkDroid [~DataLink@1.132.156.200] has quit [Quit: Bye]
04:06:17 -!- fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has quit [Remote host closed the connection]
04:06:27 fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has joined #ccl
04:22:58 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.]
04:24:33 sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has joined #ccl
04:37:59 -!- Fare is now known as Fare_afk
04:40:54 -!- bfulgham_ [~brent@cpe-76-173-170-144.socal.res.rr.com] has quit [Quit: bfulgham_]
05:20:04 -!- milanj [~milanj_@178-223-141-91.dynamic.isp.telekom.rs] has quit [Quit: Leaving]
07:15:51 -!- Fare_afk is now known as Fare
07:31:55 ZabaQ [~jconnors@85.207.11.34] has joined #ccl
07:32:04 <ZabaQ> I've been trying to find the ccl changelog.
07:32:25 <ZabaQ> ccl has stopped building with the tdm version of mingw..I'm trying to pinpoint the breaking change.
07:32:57 <ZabaQ> Actually, I'm an idiot, I've got Tortoise on this machine..
07:34:08 <ZabaQ> Hmm. The log messages are mostly "update kernel/binaries.."
08:05:56 <ZabaQ> # This makefile is written to be used with the mingw toolchain
08:05:56 <ZabaQ> # running in the Cygwin environment.
08:05:58 <ZabaQ> !!
08:06:08 <ZabaQ> I've never used Cygwin to compile it..msys was fine..
08:06:20 <ZabaQ> ..but it doesn't look like ccl has changed.
08:06:27 <ZabaQ> So, mingw probably has.
08:16:59 *ZabaQ* is now totally confused as to how he ever managed to build ccl
08:18:28 -!- kanru_ [~kanru@kanru-1-pt.tunnel.tserv15.lax1.ipv6.he.net] has quit [Quit: ZNC - http://znc.sourceforge.net]
08:21:29 kanru_ [~kanru@kanru-1-pt.tunnel.tserv15.lax1.ipv6.he.net] has joined #ccl
08:29:08 <ZabaQ> Well, it looks like tdm mingw64 and cygwin mingw 64 have exactly the same flags, so if I just change the executable name it should just work.
09:33:16 -!- billstclair [~billstcla@unaffiliated/billstclair] has quit [Ping timeout: 252 seconds]
09:33:40 -!- Quadrescence [~quad@unaffiliated/quadrescence] has quit [Ping timeout: 248 seconds]
09:34:24 Quadrescence [~quad@unaffiliated/quadrescence] has joined #ccl
11:05:46 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.]
11:38:36 billstclair [~billstcla@p-216-227-82-151.dsl1.rtr.chat.fpma.frpt.net] has joined #ccl
11:38:36 -!- billstclair [~billstcla@p-216-227-82-151.dsl1.rtr.chat.fpma.frpt.net] has quit [Changing host]
11:38:36 billstclair [~billstcla@unaffiliated/billstclair] has joined #ccl
13:09:11 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl
13:11:38 alms__ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl
13:13:19 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout: 246 seconds]
13:13:32 alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has joined #ccl
13:14:43 -!- alms_ [alms@clozure-E1F692DA.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout]
13:16:07 -!- alms__ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Ping timeout: 240 seconds]
14:00:38 <ZabaQ> ..no such luck :(
14:02:52 <rme> ZabaQ: I wish the Windows toolchains would stop changing out from underneath us.  I'm not even sure if those instructions at the top of the Makefile still work.
14:06:16 <rme> The instructions at http://trac.clozure.com/ccl/wiki/WindowsNotes worked at one time...
14:09:13 <ZabaQ> There are several different mingw distributions floating around, too :(
14:12:29 <rme> The checked-in lisp kernel binaries should generally be fairly up-to-date, so you can often get away without rebuilding the lisp kernel, and just rebuild the lisp heap image with (rebuild-ccl :clean t)
14:13:59 <rme> If you want to create a Trac ticket about this and include version info, build output, etc., I can try to take a look.
14:27:21 <ZabaQ> if I don't rebuild the kernel I get  Error: Errors (:ERROR 2) reloading boot image: reloading the inital boot image
14:27:45 <ZabaQ> if I do rebuild the kernel I get Error: Errors (:ERROR 3) and a stack dump.
14:28:03 <ZabaQ> Is there any way I can get more info about what these errors represent?
14:29:41 <rme> I think they're just values from errno.h;  so 2 would be ENOENT, and 3 would be ESRCH.
14:30:07 <rme> Is CCL_DEFAULT_DIRECTORY set in your environment?  You might try unsetting it if so.
14:52:31 <ZabaQ> no, nice try.
14:58:17 <rme> Or in other words, does (probe-file "ccl:") look right?  Any junk in your init file?
14:58:30 *rme* guesses wildly
15:02:13 <ZabaQ> Hmm. Might be picking up stuff in init file..
15:02:26 <ZabaQ> ..don't worry about it until I've had more chance to investigate it.
15:13:36 <ZabaQ> Nah, not the init file.
15:15:40 <ZabaQ> http://paste.lisp.org/display/130382
15:22:42 <ZabaQ> Annotated with backtrace http://paste.lisp.org/display/130382#1
15:23:11 sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has joined #ccl
15:24:38 <rme> That's with the lisp kernel you built, yes?  I assume it works to do (rebuild-ccl :clean t) from a fresh checkout.
15:25:30 <ZabaQ> I'll try that: but last time I tried, I got ERROR 2
15:30:54 bfulgham_ [~brent@76.173.170.144] has joined #ccl
15:32:33 <ZabaQ> ..this happens..
15:32:35 <ZabaQ> http://paste.lisp.org/display/130382#2
15:33:31 <ZabaQ> I'm using oldwx86cl.exe -n -I oldwx86cl.image to launch the image..
15:33:48 <rme> Is this is the trunk, or the 1.8 release branch?
15:34:31 <ZabaQ> 1.8 release branch.
15:34:46 <ZabaQ> Should I try trunk?
15:35:56 <rme> You can certainly try the trunk, but if the 1.8 release branch can't rebuild itself, that's not so good.
15:36:12 <ZabaQ> gcc --version reports gcc (tdm64-1) 4.6.1
15:44:42 <ZabaQ> I get the same result with the 64 bits version
15:51:25 -!- rme [rme@7A1D4CD0.F26E1912.699BA7A6.IP] has quit [Ping timeout]
15:52:34 -!- rme [~rme@50.43.156.12] has quit [Ping timeout: 246 seconds]
15:54:31 rme [~rme@50.43.135.11] has joined #ccl
15:57:04 rme_ [~rme@50.43.144.163] has joined #ccl
15:57:22 -!- rme [rme@6A97116A.80B03224.699BA7A6.IP] has quit [Ping timeout]
15:57:23 -!- rme_ is now known as rme
15:58:44 -!- rme [~rme@50.43.135.11] has quit [Ping timeout: 248 seconds]
15:58:45 -!- rme_ is now known as rme
16:06:23 <rme> ZabaQ: From a fresh checkout of 1.8, starting up with wx86cl.exe -n and then doing (rebuild-ccl :clean t) works for me.
16:07:32 <ZabaQ> what is your gcc --version?
16:08:58 <gbyers> the "error: error(2)" looks like it's coming from RUN-PROGRAM.  REBUILD-CCL will try to run wx86cl.exe; does that file exist, or was it renamed ?
16:12:07 -!- Fare [fare@nat/google/x-yobucpodnqfvlyxa] has quit [Ping timeout: 240 seconds]
16:13:06 rme_ [~rme@50.43.147.141] has joined #ccl
16:13:28 -!- rme [~rme@50.43.144.163] has quit [Read error: Operation timed out]
16:13:28 -!- rme_ is now known as rme
16:13:44 -!- rme [rme@21589A60.B02E95DC.699BA7A6.IP] has quit [Ping timeout]
16:16:08 <ZabaQ> No, it does not exist.
16:17:25 <ZabaQ> I renamed wx86cl.exe and wx86cl.image before building as I understood this was required on windows due to not being able to write to a running image.
16:21:06 <ZabaQ> aha
16:21:14 <ZabaQ> d'oh
16:21:23 <gbyers> REBUILD-CCL :FULL T would try to write to the kernel; :CLEAN T avoids that.
16:22:03 <ZabaQ> so, yes, I tried it with the kernel in place with the old name. no problemo.
16:22:36 <gbyers> It's confusing.
16:31:00 <ZabaQ> indeed
16:36:54 milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has joined #ccl
16:47:14 rme_ [~rme@50.43.133.104] has joined #ccl
16:47:52 -!- rme [rme@C3BA3E18.5174A474.699BA7A6.IP] has quit [Ping timeout]
16:47:53 -!- rme_ is now known as rme
16:48:56 Fare [fare@nat/google/x-pswsyyfwwljfcgiy] has joined #ccl
16:49:24 -!- rme [~rme@50.43.147.141] has quit [Ping timeout: 248 seconds]
16:49:58 -!- Fare [fare@nat/google/x-pswsyyfwwljfcgiy] has quit [Client Quit]
16:50:04 rme [~rme@50.43.138.72] has joined #ccl
16:51:04 -!- rme [rme@BFABE2C8.B66D9153.699BA7A6.IP] has quit [Ping timeout]
16:51:05 -!- rme_ is now known as rme
16:51:23 -!- rme_ [~rme@50.43.133.104] has quit [Ping timeout: 245 seconds]
17:02:00 -!- fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has quit [Remote host closed the connection]
17:02:09 fe[nl]ix [~quassel@pdpc/supporter/professional/fenlix] has joined #ccl
17:08:06 maxm- [~user@unaffiliated/maxm] has joined #ccl
17:08:11 -!- maxm [~user@unaffiliated/maxm] has quit [Read error: Connection reset by peer]
17:12:21 -!- bfulgham_ [~brent@76.173.170.144] has quit [Quit: bfulgham_]
17:13:31 -!- maxm- is now known as maxm
17:20:51 -!- rme [rme@8BF4C302.BBBE3EE8.699BA7A6.IP] has quit [Ping timeout]
17:22:22 -!- rme [~rme@50.43.138.72] has quit [Ping timeout: 255 seconds]
17:26:41 rme_ [~rme@50.43.162.141] has joined #ccl
17:28:58 -!- milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has quit [Ping timeout: 251 seconds]
17:55:34 <ZabaQ> Thanks for the help,.
17:56:03 -!- ZabaQ [~jconnors@85.207.11.34] has quit [Quit: Leaving.]
18:09:18 milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has joined #ccl
18:15:04 -!- sellout [~Adium@c-24-91-75-190.hsd1.ct.comcast.net] has quit [Quit: Leaving.]
18:29:14 Fare [fare@nat/google/x-wcwnsingfzoyyhnx] has joined #ccl
19:31:51 <Fare> hi. When a ccl is caught in a deadlock, how do I get from the address to the lisp lock object and/or name of the variable that holds the lock object?
19:32:08 <Fare> (say I can see the address from strace and can attach a gdb)
19:45:37 <gz> If you can get a core dump, there is stuff in CCL that understands the structure of the lisp heap from a core dump, so you could search for a symbol whose value cell contains whatever...
19:45:58 <Fare> the process is not hosed, but one of its thread is busy polling or some such
19:46:16 <gz> You can get a core dump of a live process, no?
19:46:29 <Fare> without killing it?
19:46:47 <gz> I don't know, I always thought you could, but I guess I haven't actually ever done so...
19:47:58 <Fare> I suspect one process to be busy polling for a lock, for there's incessant lock activity
19:49:32 <gz> http://www.gsp.com/cgi-bin/man.cgi?section=1&topic=gcore
19:50:33 <rme_> Fare: You think you have the address of a lisp lock object in gdb?
19:50:36 -!- rme_ is now known as rme
19:50:47 <Fare> rme: yes
19:50:50 <Fare> strace gave it to me
19:51:30 <Fare> (it's in production, so no way the image is getting out of a restricted access area)
19:52:05 <rme> Then x/1g $addr - 13 will print out the uvector header for that object.  If it's a lock object, the low byte will be 118.
19:52:18 <rme> (aka x8664::subtag-lock)
19:52:31 <rme> Do you have a working repl?
19:53:43 <Fare> <Fare> grep -v 'futex\|nanosleep' strace.log | wc -l
19:53:43 <Fare> <Fare> 10359
19:53:44 <Fare> <Fare> wc -l < strace.log
19:53:44 <Fare> <Fare> 15806145
19:53:44 <Fare> <Fare> I say the problem is with a lock
19:54:09 <Fare> no repl, but can attach gdb
19:54:13 <Fare> can I go from gdb to repl?
19:54:30 <Fare> (not responding to requests, so can't use one to start a slime)
19:55:41 <Fare> at that point, I'm putting my money on a race condition in our application, possibly related to logging, but probably not since there's still some logging happening.
20:00:41 <Fare> 21120 futex(0x5b438f0, 0x189 /* FUTEX_??? */, 0, {1341343718, 175128}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
20:02:02 <Fare> x/10g 0x5b438e0
20:02:03 <Fare> 0x5b438e0:      2       49
20:02:03 <Fare> 0x5b438f0:      549755813888    1
20:02:03 <Fare> 0x5b43900:      8589934593      97293968
20:02:03 <Fare> 0x5b43910:      48      657
20:02:03 <Fare> 0x5b43920:      -4294967283     4294967295
20:28:47 <gbyers> that doesn
20:30:46 <gbyers> that doesn't look like a futex syscall; 189 doesn't look like a constant defined in /usr/include/linux/futex.h and doesn't look like a plausible LOGIOR of some constants.
20:33:15 <Fare> we established that it was the thread in a signal handler waiting for tcr->resume
20:35:27 <Fare> hex. It's FUTEX_CLOCK_REALTIME|FUTEX_PRIVATE_FLAG|FUTEX_TRYLOCK_PI|FUTEX_WAKE
20:36:05 <gbyers> Like I said,  not a plausible combination of flags.
20:39:19 <Fare> :-/
20:39:25 <Fare> where do we go from there?
20:46:44 <gbyers> Is that the only thread waiting to be resumed ?
20:48:44 <Fare> looks like so
20:49:01 <Fare> and it's eat a lot of cpu
20:50:08 <gbyers> if it's really waiting for tcr->resume, it shouldn't be busy-waiting.
20:51:03 <Fare> then another thread must be busy-waiting -- how do i tell/
20:52:29 <Fare> it might be that some corruption caused this futex to be called with the wrong arguments, thereby hosing it.
20:52:39 <Fare> then we'd have to find the source of the corruption
21:09:37 <gbyers> When the GC runs in some arbitrary thread, it sends signals to all other threads; they enter a signal handler and wait for their "resume" semaphores (tcr->resume) to be signaled, which should happen when the GC is finished.  For some unknown reason, this thread seems to still be waiting to be told to resume.
21:23:37 <Fare> is this semaphore usually used by anything but the gc?
21:26:25 bzzbzz [~franco@modemcable151.155-57-74.mc.videotron.ca] has joined #ccl
21:26:28 <gbyers> The futex syscall may be a red herring.  If the thread's waiting for tcr->resume, it should have called sem_wait_forever(), which in 1.8 (IIRC) does a repeated sem_timedwait() with a timeout of 1 second.  If so, you might be able to get the sem_timedwait() to return 0 (by setting $rax to 0 after the syscall) and that should cause the thread to wake up.
21:27:50 <gbyers> It's rarely used by anything else.   There's a way to suspend other threads from lisp (that uses the same mechanism) but it's dangerous and I don't think it's used in CCL anywhere.
21:35:06 <Fare> cat task/*/stat | cut -d' ' -f1,14,15,16,17,18 says that thread is taking most of the cpu.
21:35:10 <Fare> I lean towards dumping a core, saving it for analysis, and restarting a new process
21:39:13 <Fare> though I feel queasy about being unable to diagnose what went wrong.
21:39:36 <Fare> is there any diagnostic we can run while it's alive that we won't be able to from a core?
21:42:03 <Fare> what is the safe way to trigger a core dump?
21:42:07 <gbyers> one thing that may or may not be helpful is to see what other threads are doing lisp-wise, but what I'm thinking of would try to write to the standard error stream.  Can you see that output ?
21:43:22 <gbyers> I think that you can detach GDB and send the process some signal that's defined to dump core.
21:43:23 <Fare> it's a pipe, probably redirected to one or our log files
21:44:06 <rme> There's a gcore command that's supposed to generate a core for a running process.
21:46:28 <Fare> gbyers: can we do the lisp introspection somehow?
21:46:53 <Fare> after i dump core, it might be too late
21:49:11 <gbyers> what I was thinking of was selecting each other thread in GDB - by "(gdb) thread N" for n=1.. and in each thread doing "(gdb) call plbt_sp($rbp)".  That'll try to do a lisp backtrace for each thread.  That may or may not be interesting.
21:50:30 <Fare> ok
22:05:01 <Fare> looking at where the daemonizer logs the stderr, I see backtraces that mention the SIGQUIT-handler
22:05:50 <gbyers> I never remember what dumps core and what doesn't.
22:08:31 <Fare> oh, that was a different process
22:10:35 <Fare> in thread 1,   1 Thread 0x7f2cccb29880 (LWP 17403)  0x00007f2ccc28786d in nanosleep ()
22:10:35 <Fare>     at ../sysdeps/unix/syscall-template.S:82
22:10:49 <Fare> (gdb) call plbt_sp($rbp)
22:10:49 <Fare> [New Thread 0x7f2cb7fff700 (LWP 12267)]
22:10:50 <Fare> [Thread 0x7f2cbe86a700 (LWP 32343) exited]
22:11:30 <Fare> (gdb) thread 2
22:11:30 <Fare> [Switching to thread 2 (Thread 0x7f2cbf91c700 (LWP 21164))]#0  0x00007f2cc97b7994 in ttcpip ()
22:11:30 <Fare>    from /usr/lib/libclntsh.so.11.1
22:11:30 <Fare> (gdb) call plbt_sp($rbp)
22:11:30 <Fare> [Thread 0x7f2cb730f700 (LWP 32344) exited]
22:11:31 <Fare> [Thread 0x7f2cc02e0700 (LWP 17867) exited]
22:11:33 <Fare> [Thread 0x7f2cbef58700 (LWP 17877) exited]
22:11:35 <Fare> [Thread 0x7f2cbdea6700 (LWP 21117) exited]
22:11:37 <Fare> [Thread 0x7f2cbd916700 (LWP 21120) exited]
22:11:39 <Fare> [Thread 0x7f2cbd386700 (LWP 21121) exited]
22:11:41 <Fare> [Thread 0x7f2cbf91c700 (LWP 21164) exited]
22:11:43 <Fare> ^C
22:11:45 <Fare> oops
22:11:56 <Fare> I'm stuck :-(
22:13:24 -!- milanj [~milanj_@93-86-188-208.dynamic.isp.telekom.rs] has quit [Quit: Leaving]
22:13:29 <gbyers> If you just did ^C, you should be back in GDB.   You may see some lisp backtraces in whatever log file.
22:13:53 <Fare> <Fare> oh well, I killed it wrong
22:13:53 <Fare> <Fare> I got gdb stuck trying to inspect the wrong thread (from oracle)
22:14:06 <Fare> <Fare> then trying to kill gdb itself, i killed the lisp process :-(
22:14:56 <Fare> sigh.
22:15:04 <Fare> well, thanks a lot for your help
22:15:05 <gbyers> Oh well.  Has this happened before to your knowledge ?
22:15:21 <Fare> not to my knowledge -- not this kind of getting hosed.
22:15:35 <gbyers> Good news and bad.
22:15:51 <Fare> sorry I bungled it
22:16:33 <gbyers> I guess "if it happens again, let us know".   I don't have a good theory.
22:51:29 -!- Fare [fare@nat/google/x-wcwnsingfzoyyhnx] has quit [Ping timeout: 265 seconds]
22:58:24 <fe[nl]ix> the CPU-hogging futex loop was caused by the leap second bug in the linux kernel
23:51:22 -!- alms_ [~alms_@209-6-130-32.c3-0.bkl-ubr1.sbo-bkl.ma.cable.rcn.com] has quit [Quit: alms_]