wj200-vfd segfaulting after updates

More
09 Mar 2015 04:10 - 09 Mar 2015 04:11 #56552 by green751
Hello, all;


I've been working to get my spindle speed control working using the scale object as described in the manual, but without luck. I keep getting a segfault from wj200-vfd just after startup of linuxcnc. For the record, I'm using a fresh install off of the debian DVD, and I've just run "apt-get update" to update the system to (I think) 2.6.7.

Today I discovered that the spindle no longer works using my "old" config either, so my problems *may* not be due to my own meddling. In the old known working config (which had manual control of the spindle) the setup for the wj200-vfd module is limited to:

# Hitachi wj200 5.5kw vfd running 5 hp mill spindle
loadusr -W wj200_vfd

#default slave address is 1
setp wj200-vfd.0.mbslaveaddr 1
setp wj200-vfd.0.commanded-frequency 60

# connect to wj200-vfd pins
net spindle-on wj200-vfd.0.run
net spindle-cw wj200-vfd.0.reverse
net spindle-at-speed wj200-vfd.0.is-at-speed

---

From the new config, which doesn't work either (same segfault):
# Set up calculation for wj200 frequency command - 0 to 60 hz
loadrt scale count=1
addf scale.0 servo-thread
setp scale.0.gain 0.015
setp scale.0.offset 0
net spindle-vel-cmd-rps-abs scale.0.in
net spindle-calculated-freq scale.0.out wj200-vfd.0.commanded-frequency


Anyone else having issues after updating?

Erik
Last edit: 09 Mar 2015 04:11 by green751. Reason: clarification

Please Log in or Create an account to join the conversation.

More
09 Mar 2015 19:56 #56563 by ArcEye
Hi

I keep getting a segfault from wj200-vfd just after startup of linuxcnc.


I don't use this component, but the first thing to check after getting a segfault, is dmesg.
In the last couple of lines should be the memory address it occured at and the lib that had / caused the exception

Will need to boot, start linuxcnc, get the error and then run dmesg.

Is there anything in the other error messages to suggest how far it is getting?
ie.
does the segfault occur when loading the component or when the params are set etc

You could try setting DEBUG=0x7777 or similar in the ini file and running from a terminal to see if any useful info is emitted

regards

Please Log in or Create an account to join the conversation.

More
09 Mar 2015 21:49 #56566 by green751
I'm seeing this in dmesg... without running gdb on the wj200_vfd program I don't think I can determine exactly what the issue is, although I've been considering editing the source to look for anywhere it might fault and having it emit an error message instead:

hm2/hm2_5i25.0: IO Pin 033 (P2-13): IOPort
[63960.900984] hm2/hm2_5i25.0: registered
[63960.900989] hm2_5i25.0: initialized AnyIO board at 0000:01:06.0
[63962.080471] wj200_vfd[4331]: segfault at 69703d73 ip 08048d72 sp bf9b4800 error 6 in wj200_vfd[8048000+2000]
[64038.640115] hm2_5i25.0: dropping AnyIO board at 0000:01:06.0
[64038.640128] hm2/hm2_5i25.0: unregistered
[64038.640222] hm2_pci: driver unloaded
[64038.643489] hm2: unloading
[64040.797309] RTAI[math]: unloaded.
[64040.800532] SCHED releases registered named ALIEN PEDV$D
[64040.804290] RTAI[malloc]: unloaded.
[64040.901080] RTAI[sched]: unloaded (forced hard/soft/hard transitions: traps 0, syscalls 0).
[64040.904331] I-pipe: head domain RTAI unregistered.
[64040.904431] RTAI[hal]: unmounted.

Please Log in or Create an account to join the conversation.

More
09 Mar 2015 21:55 #56568 by green751
Here's the debug output, I'm not seeing anything particularly enlightening in it, but maybe you will:


LINUXCNC - 2.6.7
Machine configuration directory is '/home/erik/linuxcnc/configs/lagun250'
Machine configuration file is 'lagun250.ini'
Starting LinuxCNC...
(time=1425912639.741964,pid=4671): Registering server on TCP port 5005.
(time=1425912639.742174,pid=4671): running server for TCP port 5005 (connection_socket = 3).
.
emcTaskPlanInit() returned 0
emcTaskPlanSynch() returned 0
emcTaskPlanInit() returned 0
Issuing EMC_TASK_PLAN_SYNCH -- (+516,+12, +0,)
emcTaskPlanSynch() returned 0
Issuing EMC_TRAJ_SET_TERM_COND -- (+222,+24, +0, +2,0.000000,)
Issuing EMC_TRAJ_SET_G5X -- (+224,+88, +0, +1,-0.657834,-3.468684,-2.434188,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,)
Issuing EMC_TRAJ_SET_G92 -- (+227,+84, +0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,)
Issuing EMC_TRAJ_SET_ROTATION -- (+226,+20, +0,0.000000,)
Issuing EMC_TASK_PLAN_SET_BLOCK_DELETE -- (+518,+16, +1,\001,)
Issuing EMC_TASK_PLAN_SET_OPTIONAL_STOP -- (+517,+16, +2,\001,)
Issuing EMC_TASK_SET_MODE -- (+504,+16, +3, +3,)
emcTaskPlanSynch() returned 0
Issuing EMC_TASK_PLAN_SYNCH -- (+516,+12, +0,)
emcTaskPlanSynch() returned 0
Issuing EMC_TASK_SET_MODE -- (+504,+16, +4, +2,)
emcTaskPlanSynch() returned 0
Issuing EMC_TASK_PLAN_SYNCH -- (+516,+12, +0,)
emcTaskPlanSynch() returned 0
Issuing EMC_TASK_PLAN_OPEN -- (+506,+268, +5,/usr/share/axis/images/axis.ngc,)
emcTaskPlanOpen(/usr/share/axis/images/axis.ngc) returned 0
Issuing EMC_TRAJ_SET_MAX_VELOCITY -- (+207,+20, +6,3.000000,)
Issuing EMC_TRAJ_SET_SPINDLE_SCALE -- (+233,+20, +7,0.000000,)
Issuing EMC_TRAJ_SET_RAPID_SCALE -- (+238,+20, +8,0.000000,)
Issuing EMC_TRAJ_SET_SCALE -- (+209,+20, +9,0.000000,)
Issuing EMC_TRAJ_SET_SCALE -- (+209,+20, +10,1.000000,)
Issuing EMC_TRAJ_SET_RAPID_SCALE -- (+238,+20, +11,1.000000,)
Issuing EMC_TRAJ_SET_SPINDLE_SCALE -- (+233,+20, +12,1.000000,)
Issuing EMC_TRAJ_SET_SCALE -- (+209,+20, +13,1.000000,)
Issuing EMC_TRAJ_SET_RAPID_SCALE -- (+238,+20, +14,1.000000,)
Issuing EMC_TRAJ_SET_SPINDLE_SCALE -- (+233,+20, +15,1.000000,)
Shutting down and cleaning up LinuxCNC...
Running HAL shutdown script
(time=1425912694.611647,pid=4671): Deleting 5 channels from the NML_Main_Channel_List.
(time=1425912694.611787,pid=4671): Deleting emcCommand NML channel from NML_Main_Channel_List.
(time=1425912694.611838,pid=4671): deleting NML (1)
(time=1425912694.611874,pid=4671): delete (CMS *) 0x9ca11e8;
(time=1425912694.611920,pid=4671): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0xb771e000)
(time=1425912694.612041,pid=4671): deleting CMS (emcCommand)
(time=1425912694.612085,pid=4671): free( data = 0x9ca1bd0);
(time=1425912694.612125,pid=4671): Leaving ~CMS()
(time=1425912694.612160,pid=4671): CMS::delete(0x9ca11e8)
(time=1425912694.612194,pid=4671): CMS::delete successful.
(time=1425912694.612309,pid=4671): Leaving ~NML()
(time=1425912694.612346,pid=4671): NML channel deleted from NML_Main_Channel_List
(time=1425912694.612385,pid=4671): Deleting emcStatus NML channel from NML_Main_Channel_List.
(time=1425912694.612430,pid=4671): deleting NML (2)
(time=1425912694.612511,pid=4671): delete (CMS *) 0x9ca6ab8;
(time=1425912694.612548,pid=4671): rcs_shm_close(shm->key=1002(0x3EA),shm->size=16384(0x4000),shm->addr=0xb771a000)
(time=1425912694.612597,pid=4671): deleting CMS (emcStatus)
(time=1425912694.612635,pid=4671): free( data = 0x9ca74a0);
(time=1425912694.612635,pid=4671): free( data = 0x9ca74a0);
(time=1425912694.612669,pid=4671): Leaving ~CMS()
(time=1425912694.612747,pid=4671): CMS::delete(0x9ca6ab8)
(time=1425912694.612782,pid=4671): CMS::delete successful.
(time=1425912694.612816,pid=4671): Leaving ~NML()
(time=1425912694.612849,pid=4671): NML channel deleted from NML_Main_Channel_List
(time=1425912694.612883,pid=4671): Deleting emcError NML channel from NML_Main_Channel_List.
(time=1425912694.612916,pid=4671): deleting NML (3)
(time=1425912694.613011,pid=4671): delete (CMS *) 0x9cab8d0;
(time=1425912694.613094,pid=4671): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0xb7718000)
(time=1425912694.613144,pid=4671): deleting CMS (emcError)
(time=1425912694.613182,pid=4671): free( data = 0x9cac2b8);
(time=1425912694.613262,pid=4671): Leaving ~CMS()
(time=1425912694.613295,pid=4671): CMS::delete(0x9cab8d0)
(time=1425912694.613328,pid=4671): CMS::delete successful.
(time=1425912694.613362,pid=4671): Leaving ~NML()
(time=1425912694.613395,pid=4671): NML channel deleted from NML_Main_Channel_List
(time=1425912694.613428,pid=4671): Deleting toolCmd NML channel from NML_Main_Channel_List.
(time=1425912694.613506,pid=4671): deleting NML (4)
(time=1425912694.613540,pid=4671): delete (CMS *) 0x9cae638;
(time=1425912694.613575,pid=4671): rcs_shm_close(shm->key=1004(0x3EC),shm->size=1024(0x400),shm->addr=0xb7717000)
(time=1425912694.613622,pid=4671): deleting CMS (toolCmd)
(time=1425912694.613659,pid=4671): free( data = 0x9caf020);
(time=1425912694.613693,pid=4671): Leaving ~CMS()
(time=1425912694.613770,pid=4671): CMS::delete(0x9cae638)
(time=1425912694.613803,pid=4671): CMS::delete successful.
(time=1425912694.613837,pid=4671): Leaving ~NML()
(time=1425912694.613904,pid=4671): NML channel deleted from NML_Main_Channel_List
(time=1425912694.613939,pid=4671): Deleting toolSts NML channel from NML_Main_Channel_List.
(time=1425912694.614050,pid=4671): deleting NML (5)
(time=1425912694.614084,pid=4671): delete (CMS *) 0x9caf7d0;
(time=1425912694.614119,pid=4671): rcs_shm_close(shm->key=1005(0x3ED),shm->size=8192(0x2000),shm->addr=0xb7715000)
(time=1425912694.614168,pid=4671): deleting CMS (toolSts)
(time=1425912694.614257,pid=4671): free( data = 0x9cb01b8);
(time=1425912694.614294,pid=4671): Leaving ~CMS()
(time=1425912694.614327,pid=4671): CMS::delete(0x9caf7d0)
(time=1425912694.614361,pid=4671): CMS::delete successful.
(time=1425912694.614395,pid=4671): Leaving ~NML()
(time=1425912694.614429,pid=4671): NML channel deleted from NML_Main_Channel_List
(time=1425912694.614514,pid=4671): deleting NML (1)
(time=1425912694.614585,pid=4671): Leaving ~NML()
(time=1425912694.614620,pid=4671): NML::operater delete(0x9ca1008)
(time=1425912694.614677,pid=4671): NML channel deleted from Dynamically_Allocated_NML_Objects
(time=1425912694.615391,pid=4671): deleting NML (2)
(time=1425912694.615444,pid=4671): Leaving ~NML()
(time=1425912694.615526,pid=4671): NML::operater delete(0x9ca6938)
(time=1425912694.615564,pid=4671): NML channel deleted from Dynamically_Allocated_NML_Objects
(time=1425912694.615600,pid=4671): deleting NML (3)
(time=1425912694.615634,pid=4671): Leaving ~NML()
(time=1425912694.615667,pid=4671): NML::operater delete(0x9cab6d0)
(time=1425912694.615744,pid=4671): NML channel deleted from Dynamically_Allocated_NML_Objects
(time=1425912694.615780,pid=4671): deleting NML (4)
(time=1425912694.615813,pid=4671): Leaving ~NML()
(time=1425912694.615845,pid=4671): NML::operater delete(0x9cae4b8)
(time=1425912694.615879,pid=4671): NML channel deleted from Dynamically_Allocated_NML_Objects
(time=1425912694.615912,pid=4671): deleting NML (5)
(time=1425912694.615988,pid=4671): Leaving ~NML()
(time=1425912694.616024,pid=4671): NML::operater delete(0x9caf650)
(time=1425912694.616059,pid=4671): NML channel deleted from Dynamically_Allocated_NML_Objects
erik@lagun250:~/linuxcnc/configs/lagun250$

Please Log in or Create an account to join the conversation.

More
09 Mar 2015 22:19 #56569 by ArcEye

[63962.080471] wj200_vfd[4331]: segfault at 69703d73 ip 08048d72 sp bf9b4800 error 6 in wj200_vfd[8048000+2000]


Well it shows the source of the segfault as being the wj200_vfd component, so at least it is narrowed down to that and not an interaction with something else.

Here's the debug output, I'm not seeing anything particularly enlightening in it, but maybe you will:


Afraid not

The most common causes of segfaults by far, are accessing NULL pointers and indexing beyond the end of arrays etc

I would try just loading it without referencing it at all after that.

If it loads, add back in the missing lines until it segfaults again, that might give some clue to what is going on when you look at the source.

I can't even try loading it, because I don't have a libmodbus connection etc

regards

Please Log in or Create an account to join the conversation.

More
10 Mar 2015 00:00 #56585 by green751
It looks like it's faulting when I try to set the mbslaveaddr (target address) to 1... the segfault error 6 means the program tried to write to a non existent page address, so I'm guessing that either the program failed to connect to the VFD and didn't catch that fact, or else libmodbus itself is causing the problem.

I'll see if I can find a libmodbus or other test program and verify the serial port is working and that I can talk to the VFD, then if I don't find anything there I'll check into enabling debugging somehow in the wj200_vfd program.

The program does try to print errors at some points, but I'm still not sure what sort of error logging for user space modules is available in the current incarnation of linuxcnc...

Please Log in or Create an account to join the conversation.

More
10 Mar 2015 00:08 - 10 Mar 2015 00:08 #56587 by ArcEye

the segfault error 6 means the program tried to write to a non existent page address


Yep, that was exactly the type of error I was talking about

I'm still not sure what sort of error logging for user space modules is available in the current incarnation of linuxcnc...


That is not a problem at all with a userspace component, it is just a normal ELF binary.

Run linuxcnc from a terminal and use fprintf(stderr, "This is the value of x %d", x); type syntax and you will be able to see the output.

regards
Last edit: 10 Mar 2015 00:08 by ArcEye.

Please Log in or Create an account to join the conversation.

More
10 Mar 2015 02:24 #56596 by green751
Okay, good.

The existing module does this for errors, but apparently it's not catching the one that's causing the segfault (by definition, I guess).

Erik

Please Log in or Create an account to join the conversation.

More
10 Mar 2015 03:22 - 12 Mar 2015 03:13 #56600 by green751
I think I found the issue.

(deleted my suspicions about module naming and errors)

edit II: Communications are working ok, see below.

I did some testing... I don't think it's a problem with linuxcnc at this point. Certainly loadusr and the wj200 module could be more helpful, but from what I can tell my communication link to the VFD is faulty somehow... it gets CRC errors now when I start linuxcnc, and I also tried the free program "modpoll" and got the same results.

I loaded up Hitachi's inverter management program and checked the settings, and was able to run the Inverter ok.

My theory at this point is that communication errors are producing invalid data and the wj200 module is timing out/taking a break, during which time HAL attempts to write to the module's memory causing the segfault.

I'm going to see if I can dig up a different serial->rs485 converter to use, and try wiring the connection differently with as much of it shielded as possible.

Erik
Last edit: 12 Mar 2015 03:13 by green751.

Please Log in or Create an account to join the conversation.

More
11 Mar 2015 22:21 #56685 by green751
Okay, I spent some time working on this last night.

I used a separate set of communications cable and converter to test the rs485 link to the VFD. I verified that it is in fact working normally, although after about a few hundred queries the wj200 seems to stop responding for a while... probably this is a mechanism built in to the communications port to avoid causing problems if the rs485 network isn't behaving normally.

At this point I've got my regular rs485 converter hooked back up with a shielded cable to the wj200. The free "modpoll" utility works and is able to retrieve data from the wj200, which I verified as valid (rather than random numbers) so I know the hardware works. I decided to check for problems with versions of linuxcnc.

I uninstalled 2.6.7 and tried 2.6.1, same result (segfault in the wj200_vfd program).

I also uninstalled 2.6.1 and tried 2.6.0, same result.

I loaded up my old (known working) configuration files from two months ago, same result.

I then copied over the linuxcnc-dev tree from two months ago (again, known working on my old OS install) same result (after recompile, it wouldn't run until I did that because of RTAI).

I then did a git pull of the latest linuxcnc-dev tree, verified the same result, and started inserting debugging statements into the code, compiling, and testing. I have determined that the segfault comes in the following statements:

is_running = status.running;
is_at_speed = status.at_speed;
is_ready = status.ready;
is_alarm = status.alarm;

Basically this is where the main loop inside wj200_vfd writes status information to the pins exposed to HAL. So the segfault isn't happening in libmodbus or the wj200_vfd code itself, it's something to do with the comp (now "halcompile?") generated pin data structures/#defines. I've been working on figuring out how and why halcompile generates the pin data structures to see what's not getting allocated or what is too small for the data type being put into it. It's odd that the same situation happened using the old (pre-2.7.0) source that I tried from a couple of months back, since that code worked fine. There must be something different in either the system libraries or the C compiler's handling of the code generated by halcompile. I did look, and none of the other VFD modules use comp, they're all written in C directly.

I may try to verify this by loading my old OS environment, which was based on the old linuxcnc 2.5 ubuntu install disc. I did git pulls of the new source back then to update and try new versions. The current environment that doesn't work is the Debian Wheezy disc that's available for download now, which I've run apt-get update and apt-get upgrade (but not a system upgrade) on, and it's running linuxcnc-2.6.7 and libmodbus 3.0.3.

A couple of other notes so far:

1) wj200_vfd is written in such a way that if a given instance can't communicate with the vfd, it will wait forever (or until unloaded) until it does so, and will not execute its main loop to update data or send commands to the vfd. This makes sense functionally, however it also means that whatever bug is happening here is not triggered unless communication with the VFD can be established. So if this module got tested by loading it up without the VFD, it would appear to work, report ready, and create pins but not actually do anything and not create a segfault or show any other problem. This might be an issue for anyone trying to test it.

2) I did try running halrun and manually loading the wj200_vfd module, but I didn't get a segfault. I suspect I wasn't actually "running" the module because I'm not that familiar with how that utility works, and that I might need to load a real time thread and start it up to do a full test.

3) I'd like to try running gdb against the wj200_vfd program while linuxcnc is running - does anyone know if I can just compile wj200_vfd with debug information and attach to the process once it's running? Timing would be difficult, of course. Or can I have linuxcnc run loadusr on gdb with wj200_vfd as a parameter to gdb?

Erik

Please Log in or Create an account to join the conversation.

Time to create page: 0.226 seconds
Powered by Kunena Forum