EtherCAT Communication Issues, Lost Frames, Interruptions in Program Execution

More
29 Aug 2024 10:41 #308924 by stenly
Hello everyone,

this will most probably turn out to be a rather long write and it is my first time posting here so please bear with me. I would say I am a beginner with LinuxCNC, despite being quite the general GNU/Linux nerd but I have done a lot of reading on the forum and the docs in the past month.

I have "inherited" so to say the machine from the person who wrote this post . The backstory is in there if you need any more context.

In short, I am still sort of stuck at the same problem the colleague before me couldn't resolve, namely the machine (servos, spindle and IO) being inconsistent with its EtherCAT connection (I assume). The first sign was that initially it was basically a coin flip as to whether the .ini program would even start. Most times it exited with an error about the Master PDOs not being correct and dmesg would report some random pin (that was defined in the xml) not being present. Different LAN cables were tried, but that didn't help so I was initially worried the IO was busted, but I pushed on and managed to drastically improve this problem in particular. I discovered that the OS was trying to establish an Ethernet connection over the LAN cable which to my understanding can definitely disturb EtherCAT communications. I made sure to disable that which allowed me to basically launch the .ini with certainty. It is still not perfect, since sometimes after a reboot I need to restart the EtherCAT driver with ethercatctl restart and/or replug the cable but after that there are no issues with this until the next reboot at least.

Just as I thought I was out of the woods, I would notice that on some executions of a test program I would run (not all, still a coin toss), the machine would emit uncomfortable noises as the axis moved. Not only that, but usually when I would leave it running for a few hours, I would find LinuxCNC reporting some random each time IO error and the program having stopped (interestingly it is the same behavior that I observe upon directly yanking the LAN cable out as it is in operation). I see in the previous post my colleague had the same issue which you guys suggested was due to a packet loss of some sort. And so it would seem, as ethercat masters -v does sometimes report lost frames. Sometimes in the tens, sometimes in the hundreds, sometimes in the thousands.

The one suspicious thing I think I have spotted is the latency test. It would run between 30k-65k ns but as ethercat masters -v start counting up on lost frames, the latency increases two or even threefold. The lost frames would then stop increasing and normal behavior is resumed. Since I couldn't find a great reference, do you believe these latency numbers are to be expected?

I am quite stumped on what to do. I tried monitoring for lost packets through WireShark on the Master PC, but there was no odd behavior there. The irq_coalescence suggestion by rodw didn't seem to help and setting refClockSyncSycles to -1 as suggested by db1981 didn't allow me to launch the .ini in the first place. Although I do admit I didn't quite understand what setting it to -1 is supposed to achieve. I see that in order to do this you would need to apply some sort of patch, but since that was for lcnc 2.8 I am not sure if I should try this on the current version 2.9.3. How would I go about installing it when my EC installation is through the repositories instead of from source?

Do you guys have any ideas on what I could attempt? I think my plan for the short term will be to try to use a laptop as a WireShark middleman to better monitor for packet loss according to the PDF I will leave in the attachments (I have no source for it sadly). If that doesn't yield anything, I will probably try to replace the IO module with some sort of other IO Ethercat board and redo the mappings to see if it is indeed the IO after all, while triple checking for faulty wiring as I go.

I am also attaching my .xml, .hal and .ini files just in case.

Any help or suggestions are very appreciated. Do let me know if I can specify something I've missed in this post. Thank you!

PS. I realize I should probably mention that the onboard PC is running off a SATA SSD to which I cloned my LinuxCNC virtual machine test bench to. From what I have seen there do not seem to be any driver discrepancies or errors, but I thought I would point it out as it is yet another quirk in my setup.

Please Log in or Create an account to join the conversation.

More
29 Aug 2024 12:34 #308927 by tommylight
Not EtherCAT but it is general networking:
-check cables one by one
-test only a single drive at a time
-for wireshark you would need an old hub, not a switch, to be able to see all the trafic, or use the laptop as a proxy between the control pc and ethercat equipment
-wireshark needs a network card that can do "promiscuous" mode to monitor everything, some cards can not do that despite working with wireshark and showing some trafic.
The following user(s) said Thank You: stenly

Please Log in or Create an account to join the conversation.

More
29 Aug 2024 12:40 #308928 by rodw
I it correct that the PIDs for all drives is 0?
What does ethercat slaves -v say?
Can you confirm that there is now a dedicated ethernet connection for ethercat?
The following user(s) said Thank You: stenly

Please Log in or Create an account to join the conversation.

More
29 Aug 2024 13:58 - 29 Aug 2024 13:59 #308933 by stenly
Hm, you are correct, it is weird how they are all zeroes, except the 6th slave which is a 1. But this is indeed the output:
ethercat slaves -v | grep Product
Product code: 0x00087611
Product code: 0x00000000
Product code: 0x00000000
Product code: 0x00000000
Product code: 0x00000000
Product code: 0x00000000
Product code: 0x00000001

(I have attached the full output as a separate file so as not to have a wall of text)

What does this suggest in your opinion? I do have the xml from the manufacturer on me (attached as well) and it does look like these are the intended PIDs even there. Could this weird decision from the manufacturer be causing some conflict with LinuxCNC?

Also regarding your latter question, I do believe that is the case, but I am not sure if there is any better way to confirm this than simply removing all wired connections from the XFCE internet settings. I even disabled the other ethernet port of the PC. This also stopped the constant notifications that the network was disconnected. Do you have anything else in mind I can try to be 100% certain?
Attachments:
Last edit: 29 Aug 2024 13:59 by stenly. Reason: formatting fix
The following user(s) said Thank You: rodw

Please Log in or Create an account to join the conversation.

More
29 Aug 2024 14:01 - 29 Aug 2024 14:02 #308934 by stenly
Thank you for your suggestions as well, Tommy. Sadly, there is a planned power outage tomorrow so I will be able to put them to practice on Monday at the earliest... I'll post an update immediately afterwards.
Last edit: 29 Aug 2024 14:02 by stenly.
The following user(s) said Thank You: tommylight

Please Log in or Create an account to join the conversation.

More
03 Sep 2024 05:48 #309262 by stenly
After more trial and error and checking the cables and slaves one by one, I am going to go with the working theory that the NIC is problematic in some way, as the biggest improvement yet was after I deleted the Ethernet connection it was trying to establish.

The onboard PC has two Intel I211 interfaces. I found this post and thought to try out the suggestions. I wasn't sure if I had the dkms driver for my interface, but lspci -v reported the driver as "igb." Upon some googling this GitHub repo came up. It seems there is (or was?) a known issue with I210/I211 NICs that causes the driver to terminate due to some checksum mismatch. I am not sure if this applies to me exactly, since even if it does terminate for me, it is not until the next reboot, but just momentarily - enough to spike the latency and desync the process. Either way, I wasn't able to install the patch from the repo as it is for an old kernel and sudo apt install fails because of that. Do you suppose I should pursue this further or has this issue with the driver been resolved by Intel in the past years?

The next thing I did was to add all of these kernel parameters:
efi=runtime igb.EEE=0 pcie_aspm=off pcie_port_pm=off pcie_aspm.policy=performance loglevel=3
I also disabled EEE from ethtool.

I believe this improved things yet again since now the situation is as follows. On running a test program with a max velocity of 1500mm/min, there are actually no issues to be reported. It's been running for 4 hours already. However, upon increasing the velocity (the max is 6000mm/min), the program fails in about 30 minutes and reports a joint error, if I don't stop it before that because of the worrying noises the axis emit.

Due to this, I am seriously suspecting the NIC in general and the latency as the culprits. For the time being I will try out the suggestions in the Arch Wiki for realtime process management . I would nevertheless still be very grateful for any other help and ideas.

Please Log in or Create an account to join the conversation.

More
12 Sep 2024 05:49 #309968 by stenly
Hi again.

I spent quite a bit of time on this in the past week and I believe I narrowed down the problem. My worry is that I don't know if it can even be solved, haha.

Firstly, I dug deeper into what rodw suggested with making sure I had a dedicated NIC to EtherCAT. Turns out the generic EC driver doesn't reserve one for itself. I found out there is an igb driver in the 6.1 kernel, which was great. I had to remove everything EC related and rebuild it from source, however, since the configure tool needed the --enable-igb option passed to it for igb support. You can't do that when you install from repositories AFAIK. Either way, now the NIC doesn't even show up under "ip a" anymore. EtherCAT has it completely reserved for itself, which is great.

Now, the real culprit. As I mentioned before, I saw some blink and you'll miss them spikes in the latency. After a lot of reading, I came to the conclusion it is related to Intel SMI (System Management Interrupt) . Diving into some more old documentation, I saw it mentioned as a culprit . However, these posts seem to be very very old at this point. The tool suggested at the end of that wiki article did not help me, as it only reports that "No SMI-enabled chipset found".

This cannot be the case, though, as in my BIOS there is clearly a setting called "Global SMI Lock." Additionally, I found this utility called BITS. It provides exactly what I need - an SMI latency test. The PC failed said test miserably, with 300μ being reported. That is, by the way, the exact size of the spike the LinuxCNC Latency Test reported. I am attaching a photo of the test result. I am assuming the tool from the wiki article is simply very old and does not support the onboard PC's chipset (the CPU of which is an Intel Celeron N2930). The MSR 0x34 register as reported by turbostat is also 1.17, which I am not sure how to interpret exactly, but seeing as it is well over 0, it must not be good.

I will spend the day trying to somehow disable SMI and testing the machine with a different PC that I know passes the BITS SMI test.

Does anyone have any idea on how to go about disabling SMI? I believe the only option in my BIOS is this Global SMI Lock I mentioned which didn't seem to do much. On some old forums I saw people mentioning you need to mod and reflash your BIOS. I'd like to leave that as a last resort as doing it improperly will essentially deep fry the motherboard.
Attachments:

Please Log in or Create an account to join the conversation.

More
12 Sep 2024 11:52 #309994 by tommylight
Try disabling in BIOS, hyperthreading, turbo, speedstep, TPM, any power saving option, pci-e agresive link management or similar, modem, audio, serial ports, etc.
Also helpful, set a value for shared memory, do not leave it on auto, this i keep forgeting as it causes latency spikes whenever it adjusts the amount of memory dedicated to integrated graphic.
The following user(s) said Thank You: stenly

Please Log in or Create an account to join the conversation.

More
12 Sep 2024 13:58 #310007 by stenly
Thank you for the response, Tommy.

I had tried all of those before actually. I hadn't looked into adjusting the shared memory, so I tried that now, but there doesn't seem to be an option for that at all. It seems it is a pretty locked down BIOS...

Did more reading and it seems that switching to the RTAI kernel is a solution for some. I figure it's worth trying out before I look into buying a new onboard PC entirely. I tried to install it on a fresh Debian Bookwork install as suggested in the docs and did the instructions from the README.INSTALL . However, I am not able to boot properly into the newly created grub kernel entry still. There is some error of the root device not being able to be mounted properly? Even though the Debian install still works with no issues. I'm attaching a photo of the error as well.

Is there some sort of better guide on how to install the RTAI kernel and LinuxCNC on it?
Attachments:

Please Log in or Create an account to join the conversation.

More
12 Sep 2024 14:46 #310010 by tommylight
The error means the GRUB is OK and working but the installation was done on a drive that no longer exists, probably into a USB device.
For RTAI see:
linuxcnc.org/docs/2.9/html/getting-start...#cha:Installing-RTAI
The following user(s) said Thank You: stenly

Please Log in or Create an account to join the conversation.

Time to create page: 0.093 seconds
Powered by Kunena Forum