Hello everyone,
this will most probably turn out to be a rather long write and it is my first time posting here so please bear with me. I would say I am a beginner with LinuxCNC, despite being quite the general GNU/Linux nerd but I have done a lot of reading on the forum and the docs in the past month.
I have "inherited" so to say the machine from the person who wrote
this post
. The backstory is in there if you need any more context.
In short, I am still sort of stuck at the same problem the colleague before me couldn't resolve, namely the machine (servos, spindle and IO) being inconsistent with its EtherCAT connection (I assume). The first sign was that initially it was basically a coin flip as to whether the .ini program would even start. Most times it exited with an error about the Master PDOs not being correct and dmesg would report some random pin (that was defined in the xml) not being present. Different LAN cables were tried, but that didn't help so I was initially worried the IO was busted, but I pushed on and managed to drastically improve this problem in particular. I discovered that the OS was trying to establish an Ethernet connection over the LAN cable which to my understanding can definitely disturb EtherCAT communications. I made sure to disable that which allowed me to basically launch the .ini with certainty. It is still not perfect, since sometimes after a reboot I need to restart the EtherCAT driver with
ethercatctl restart and/or replug the cable but after that there are no issues with this until the next reboot at least.
Just as I thought I was out of the woods, I would notice that on some executions of a test program I would run (not all, still a coin toss), the machine would emit uncomfortable noises as the axis moved. Not only that, but usually when I would leave it running for a few hours, I would find LinuxCNC reporting some random each time IO error and the program having stopped (interestingly it is the same behavior that I observe upon directly yanking the LAN cable out as it is in operation). I see in the previous post my colleague had the same issue which you guys suggested was due to a packet loss of some sort. And so it would seem, as
ethercat masters -v does sometimes report lost frames. Sometimes in the tens, sometimes in the hundreds, sometimes in the thousands.
The one suspicious thing I think I have spotted is the latency test. It would run between 30k-65k ns but as
ethercat masters -v start counting up on lost frames, the latency increases two or even threefold. The lost frames would then stop increasing and normal behavior is resumed. Since I couldn't find a great reference, do you believe these latency numbers are to be expected?
I am quite stumped on what to do. I tried monitoring for lost packets through WireShark on the Master PC, but there was no odd behavior there.
The irq_coalescence suggestion by rodw
didn't seem to help and
setting refClockSyncSycles to -1 as suggested by db1981
didn't allow me to launch the .ini in the first place. Although I do admit I didn't quite understand what setting it to -1 is supposed to achieve. I see that in order to do this you would need to apply some sort of patch, but since that was for lcnc 2.8 I am not sure if I should try this on the current version 2.9.3. How would I go about installing it when my EC installation is through the repositories instead of from source?
Do you guys have any ideas on what I could attempt? I think my plan for the short term will be to try to use a laptop as a WireShark middleman to better monitor for packet loss according to the PDF I will leave in the attachments (I have no source for it sadly). If that doesn't yield anything, I will probably try to replace the IO module with some sort of other IO Ethercat board and redo the mappings to see if it is indeed the IO after all, while triple checking for faulty wiring as I go.
I am also attaching my .xml, .hal and .ini files just in case.
Any help or suggestions are very appreciated. Do let me know if I can specify something I've missed in this post. Thank you!
PS. I realize I should probably mention that the onboard PC is running off a SATA SSD to which I cloned my LinuxCNC virtual machine test bench to. From what I have seen there do not seem to be any driver discrepancies or errors, but I thought I would point it out as it is yet another quirk in my setup.