Axis 2.9.0~pre: jogging moves the tool cone, but not the steppers

15 Sep 2022 19:51 #252037 by Dr. John
I'm still trying to wade through the maze of complexity, but I was told by the moderator that everything funnels though halrun(halcmd). Near as I can tell, this seems to be true. It's why I built a trace into halcmd, so see what it's receiving from axis. Outside of the configuration generated by pncconf, it seems like there is nothing. But, if there is a side channel for communication of motion commands, then I'm not getting full visibility into what's going on.

None-the-less, it's clear that I can see what I think are motion commands being sent to the I/O card from pncconf and that these are absent with axis. It's one of the various reasons why I think one has to look in the direction of axis to find where the problem lies.

Thanks for your help.

Please Log in or Create an account to join the conversation.

15 Sep 2022 20:06 #252040 by Dr. John
With regard to, I think you're referring to the emdSendCommand function. It's the one that seems like it might send data to the lower levels of the system. I note that a search throughout the entire set of files only shows that command in three locations:

1) within itself;
2) within emcmodule.o; and
3) ../python/

If this indeed is the means for sending commands to the I/O card, it seems like nobody is using it, at least on my system.


Please Log in or Create an account to join the conversation.

15 Sep 2022 20:36 #252044 by arvidb
The code is incredibly obfuscated. It'd probably win all sorts of obfuscation contests if it was entered. :)

For an Axis jog, I think the call stack contains something like this: def jog(*args):
... which calls: c.jog() // where c is set at c = linuxcnc.command()
... which somehow ends up at: static PyObject *jog(pyCommandChannel *s, PyObject *o) {
... which in turn calls:
emcSendCommand(s, ...);

I'd be surprised if your problem is due to a bug in this part of the code though. But hey, you never know!

Please Log in or Create an account to join the conversation.

15 Sep 2022 20:50 #252046 by Dr. John
I agree about the obfuscation part. It's worse if you don't program in python, tcl, c++, and avoid using C as a matter of course. ;-)

Of course, one of the issues is that every decision point comprises a failure mode and, near as I can tell, nobody thought about trying to reduce the number of these. Rather, the effort was towards getting to something that would work. Over the years it's worked well enough so that it was "good enough", something that I understand. But, it doesn't mean that failure modes don't exist, nor that they won't manifest themselves at some point, according to Murphy, the least convenient time. The main good thing about the current situation is that it's highly repeatable. That suggests only a single decision point isn't working. The problem, of course, is to find which one!

I still tend to think that the problem is related to configuration/environment which causes the code to fail. It's the main difference from all of the other successful installations that currently exist. And, it wouldn't be that unusual for a person, namely me, to have something that is different enough from normal so cause this result. Still, the search is make immensely more complicated due to the complexity of the code.

Oh well. Thanks again for your help.

Please Log in or Create an account to join the conversation.

15 Sep 2022 21:33 - 15 Sep 2022 21:34 #252048 by PCW
Since your hal and ini files (which contain _all_ of LinuxCNCs configuration information)
have been tested and work with LinuxCNC 2.9 on a 7I96S
this pretty much rules out configuration or LinuxCNC.

Again, this suggests a hardware or perhaps driver issue,
in any case something specific to your system.

I would try a different host, and if that doesn't fix the issue,
return the 7I96S to Mesa for evaluation.
Last edit: 15 Sep 2022 21:34 by PCW.

Please Log in or Create an account to join the conversation.

15 Sep 2022 21:43 #252050 by Dr. John
I"m sorry. You and I keep going around in circles. The problem is that pncconf works just fine. That eliminates the issue of drivers or something specific to my system in terms of hardware and/or OS. I can run the test on pncconf all day (which I won't because of the unnecessary wear and tear) without difficulty. It's only when I run axis that I have the issue. That clearly points to code associated with axis as to where the problem exists.

Of course, I'm repeating myself. I hope that I can have a productive dialogue with arvidb. At least he is open to accepting and responding to the evidence. I hope to not have any further dialogue with you. It's been totally unproductive.

Please Log in or Create an account to join the conversation.

15 Sep 2022 21:56 #252051 by arvidb
The fact that noone else seems to have this problem "eliminates" code associated with Axis as well though. Clearly, it's an unusual problem and it's too early to eliminate anything, really.

You'd do well to listen to PCW; he's the owner of Mesa and knows what he's talking about better than most - he's probably the one who designed the network protocol as well as wrote the firmware and (much of?) the drivers. You really should thoroughly eliminate any possible source of your problem that he identifies before looking at other things. Or you just risk wasting everyone's time.

Please Log in or Create an account to join the conversation.

15 Sep 2022 22:38 #252052 by Dr. John
Thank your for your feedback. I very much appreciate it. Of course, I had no idea to whom I was communicating.

Still, there seems to be a stubbornness related to accepting that things work in the pncconf code and not the axis code. Given that things work with the pncconf code, and pncconf accesses the card via halcmd, that suggests that everything south of halcmd is functional. Unless axis uses a side channel for communicating with the I/O card that bypasses halcmd, that is pretty clear evidence that the problem is north of halcmd. That means axis and everything that is uniquely associated with axis.

So, while I respect the position of PCW and therefore his authority, I do have a problem with his diagnostic/logic skills. I've already posted logs that demonstrate that axis isn't delivering the jog commands to halcmd whereas they are being delivered to halcmd from pncconf. I'd say that this is smoking gun evidence of where to look for the problem.

It is quite clear that axis having run on tens of thousands of installations is no guarantee that it will run successfully on all installations. It simply shows that it has a high probability of running on an installation. It doesn't preclude a failure.

I doubt that the cause has anything to do with the source code, though I might be wrong. Having run on thousands of installations does give one confidence in the source code. So, the most likely problem lies with the system configuration in one way or another. If, for example, my version of tcl is later than the ones that have been previously tested and it causes a silent failure that nobody else has run into this until now, that could explain things. We know that software engineering isn't perfect given the huge number of updates that I install on a regular basis, so that isn't outside the range of possibilities. There are other possibilities as well.

With all due respect for PCW, and with apologies for possibly hurting his feelings, I need to work with someone who will follow the evidence. So far, he hasn't.

Please Log in or Create an account to join the conversation.

23 Sep 2022 18:59 #252631 by Dr. John
FWIW, here’s some more info on this problem. Recently, I’ve been tossing out old computers due to a pending move, so I’ve really got no spares. But, I realized that my old 32 bit LinuxCNC computer was still a viable candidate for installing the latest version of LinuxCNC, which I did. The results were unexpected: exactly the same. pncconf works just fine; axis doesn’t at all. These results don’t help isolate the problem much, but they tend to rule out the probability of a problem with the computer hardware or the OS (the old one is Lubuntu 18.04).

What more then could I do to triangulate the problem? Well, I had additional data from tcpdump, so I decided to analyze it. Each analysis included about 100,000 packets. Here are my results.

1) On average, there are two packets transmitted by the computer for each one transmitted by the card. The standard deviation of the times between packets is about 35us and the maximum period between a received packet and the next transmitted one in successive packets, i.e, the two that come before the response packet ,is 880us typically, and exceeds 950us once (954us). There are zero dropped packets.

2) I reverse engineered the communication protocol within the udp packets in accordance with the documentation for the 7i96s card. All packets are being transmitted are within the bounds of the documented protocol (although I can’t dig deeper than this because it requires a deeper understanding of how the card works than I’m willing to try to get). All of the response packets return the correct number of bytes in accordance with the read commands.

3) This true for both axis and pncconf. The differences between the two sets of results are essentially zero.

What this tells me is that the ethernet hardware, the handlers within the OS, and the parts of LinuxCNC that are involved with setting up packets, transmitting and receiving them, and scheduling their transmissions are working correctly. It appears that the firmware in the card is working correctly too, at least as it relates to the reception, interpretation and response to the packets it receives. Though I don’t know for sure, I assume that the handlers for these operations are identical for pncconf and axis.

From arvidb’s documentation of the communications call stack and from my further investigation into the hm2eth software, it seems improbable that something is mangled along the way. The message handling seems straightforward, although how it gets from level to level is not always obvious (to me).

This suggests some sort of incompatibility between axis and the 7i96s card that I have. Given that pncconf moves the axes on my mill correctly with the 7i96s, I have doubts about whether the problem is with the card, but can’t entirely eliminate it. Could it be, for example, a new “feature,” needed by axis, was added to the card at the last moment but somehow the firmware version that I have is one that precedes that? There is no way that I can assuredly make that diagnosis, nor eliminate the possibility. Otherwise, axis and the software upon which it uniquely depends remains the primary likely cause for the problem.

Because of the move I’ve been in a race against time. At this point it looks like I’ll assuredly lose. If I return the card, then, if things go well for the move (and if the card arrives on time), it will be packed in a box apart from my mill which will ship only partially assembled, and I’ll have to discover its location six months or more down the road when I’m ready to try to bring it up. Since my move is overseas, I’ll have to explain to the customs agent why I need yet another computer, this one 15 years old, because it’s the only one that has a PCI bus which I need to keep the mill functional until I can reintegrate it will the 7i96s. Making it work with my 5i25 PCI card is / will be harder than it seems because of the fact that I’ve had to repackage the mill in order to make it ready to move and I designed the package around the 7i96s. Hence, a rather intense motivation to get things right before I move.

You should know that I’m an electronics engineer, with 45+ years of experience, much of it managing others, and a good deal of it associated with writing software/firmware for medical devices. I know full well the impossibility of proving the negative, i.e., that there are no bugs, especially in systems as complex as LinuxCNC. I also am well familiar with the problem of having obstinate engineers who won’t investigate their own work to discover why things won’t work, even if the evidence points there. I’ve got about a 90% hit rate when I follow the evidence without prejudice. It’s been my experience that following the evidence is a much faster way getting to and fixing problems than any other way, and right now time is of the essence. If my style has been abrasive because of my insistence on it, I hope you’ll understand why.

Of course, that leaves the other 10% where I’m wrong, and this could yet be one of those cases. I’m open to other troubleshooting suggestions and other potential failure modes if you’ll let me know what they are and provide justification as to how they can induce the current behavior. If there are none, I guess I’ll return the card, take the hit, and hope for the best.

Please Log in or Create an account to join the conversation.

23 Sep 2022 22:00 - 23 Sep 2022 22:03 #252637 by rodw
I've just read through this thread and 2 things come to mind.

1. Do your steppers have holding torque when your system is powered up with axis open?
2. the Error finishing read you experienced.

if you don't have holding torque and can turn your motors by hand, Please disconnect the enable wires on the DM542 driver itself and try again.

Error finishing read is a more sinister issue that is  at the Linux kernel or NIC driver level outside of Linuxcnc. This is casued by excessive latency on the ethernet NIC and has been more prevalent with later OS distributions.  It happens when the ethernet latency exceeds the servo thread time period. Once an error finishing read ocurrs, the game is up and the mesa card is  fully disabled until a restart of linuxcnc. Its mostly associated with Realtek NICs such as are no doubt used on your N3160 USFF PC. I've used similar PCs for a long time and later Debian verions (10 and higher) are causing error finishing read on them now. Sharing the output from typing halcmd show param *tmax* would give some insight. (hope I got that command right). PCW has recently shared some GRUB command line tweaks that I have yet to try that may help.
Last edit: 23 Sep 2022 22:03 by rodw.

Please Log in or Create an account to join the conversation.

Time to create page: 0.537 seconds
Powered by Kunena Forum