Ethernet latency issue with 2.9.2 and 7i95T

More
20 Mar 2024 15:26 #296390 by Finngineering
Hello,

I'm currently "restoring" a Maho MH600T to use LinuxCNC, with a Mesa 7i95T as the main interface card:
forum.linuxcnc.org/30-cnc-machines/49758...mizing-a-maho-mh600t
Unfortunately, I have run into issues with ethernet latency, and get unexpected realtime delays and watchdog has bit issues. From monitoring the hm2_7i95.0.read.time, it's evident there are some massive spikes. See the screenshot below and console output from starting LinuxCNC. The highest spike is around 20 000 000, while the base level is around 300 000.

Warning: Spoiler!


The computer is a HP Z240 SFF Workstation (J9C02ET#ABU) with i7-6700 CPU. I have done quite a bit of reading about other similar issues. The computer has an in-built I219-LM ethernet adapter, that I read somebody else had issues with. So I bought a PCI-e TP-Link TG-3468, which has the Realtek RTL 8168 chipset. But I get the same issues with both cards. And later I read there are also issues with the Realtek chipsets. And it's not easy to find anything else than Intel or Realtek.

I have disable all "performance" and power management features I can find in the BIOS, and modified the kernel options to:
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.6.7-rt18-linuxcnc root=UUID=a2994d0f-a7e5-42a0-a350-8616556fa365 ro quiet idle=poll intel_idle.max_cstate=1 processor.max_cstate=1 i915.enable_dc=0
Still, no better.

As you might see from the kernel boot options above, I also compiled a custom kernel (actually several), as I read that the issue was with the stock debian kernel. It didn't help me. I followed the instructions here:
forum.linuxcnc.org/10-advanced-configura...6-7-improved-latency

According to some troubleshooting advice by PCM, I also did some ping test:
$ ping -c 5 -i .2 10.10.10.10
PING 10.10.10.10 (10.10.10.10) 56(84) bytes of data.
64 bytes from 10.10.10.10: icmp_seq=1 ttl=64 time=0.101 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=64 time=0.061 ms
64 bytes from 10.10.10.10: icmp_seq=3 ttl=64 time=0.067 ms
64 bytes from 10.10.10.10: icmp_seq=4 ttl=64 time=0.061 ms
64 bytes from 10.10.10.10: icmp_seq=5 ttl=64 time=0.060 ms

--- 10.10.10.10 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 814ms
rtt min/avg/max/mdev = 0.060/0.070/0.101/0.015 ms
and for a couple of minutes:
$ sudo chrt 99 ping -i .001 -q 10.10.10.10
PING 10.10.10.10 (10.10.10.10) 56(84) bytes of data.

--- 10.10.10.10 ping statistics ---
98834 packets transmitted, 98833 received, 0.0010118% packet loss, time 99322ms
rtt min/avg/max/mdev = 0.049/0.068/9.884/0.133 ms
With a maximum response time of almost 10 ms, no wonder there are some issues.

However, the "normal" system latency does not appear bad. Here is a latency-histogram from surfing the web and writing this post, playing youtube and having 5 glxgears running:
 

I actually used this same computer on 2.8.4 earlier and without many issues. However, I do think I got the same issues every once in a while. Maybe after an hour or so. But that was with no tuning of BIOS or kernel options and with the Intel network card without changing irq coalescing options (if that was at all relevant at that time).

I start to run out of ideas. But maybe somebody here has some advice?
Attachments:

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 15:43 #296391 by PCW
That is a huge latency spike (almost 6 ms!)

Have you disabled all power management options/turbo modes/hyperthreading in the BIOS?

 

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 16:13 #296397 by Finngineering
Yes, power management, turbo and hyperthreading disabled in BIOS. I can take a photo later for reference.

However, just after I wrote the post, it occurred to me that for the sake of completeness, I should remove the USB/ethernet dongle I use for internet access. At the same time, I also removed a wireless ethernet card I had installed previously. Lo and behold, after this I have not seen the errors, although it has only been half an hour or so. I plugged the USB/ethernet dongle back in to write this message, and its still not complaining.

I will monitor for a few hours and write a new post with the results. And I will figure out if it was the wireless ethernet card (TP-Link TL-WN781ND) or the USB/ethernet dongle that was causing the issues. And I will check if I can go back to the stock kernel as well. I will report back once I have some more information to share.

By the way, what is the formula to convert tmax to milliseconds (if indeed there is such a thing)?

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 16:25 - 20 Mar 2024 16:25 #296399 by PCW
Non USB wireless cards are very often implicated in poor network latency
(I think it because actual BIOS add-on card code is run directly from ROMs on these cards)

tmax values (on X86) are in clock cycles so time in seconds = tmax_value/cpu_clock _frequency
(tmax values on ARM are in ns)
Last edit: 20 Mar 2024 16:25 by PCW.

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 20:49 #296408 by Finngineering
The computer was now running for ~2.5 hours without issue, and the hm2_7i95.0.read.tmax was around 900 000. I then turned off the computer and reinstalled the wireless card. And again I get the error messages almost immediately upon starting LinuxCNC. So for me it's quite clear that the wireless card was causing issues for me. And that wireless card was not connected to a network (never even configured to connect to one). Anyway, I never planned to use that card other than for installation/setup, so no problem for me.

I will see what more I can determine. And if the stock kernel will work. But it may be a couple of days before I get the chance to do that.

Thanks for the tmax explanation. So in my case with "baseline" around 300 000, it's just shy of 0.1 ms with 3.4 GHz clock frequency.

BIOS photos also attached, but probably of no interest as of now.
Attachments:

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 21:01 #296410 by tommylight
Try leaving multi processor on, otherwise the PC might be very sluggish.
It has never caused issues for me.

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 21:15 #296413 by PCW
Yes, use all cores, worst case you can enable the last core for LinuxCNC use with isolcpus

Please Log in or Create an account to join the conversation.

More
20 Mar 2024 21:27 #296414 by Finngineering
I turned off all most "extra" features in BIOS, but maybe went one step too far with that one. (The computer is not sluggish at all, though)

I will re-enable it, and may go a bit deeper with isolcpus and the IRQ affinity/routing to see how that works out:
forum.linuxcnc.org/38-general-linuxcnc-q...ead-and-irq-affinity
I just saw you PCW mentioned that thread in another reply. That is one thread I had not come across before (even if its marked as sticky).

Please Log in or Create an account to join the conversation.

More
25 Mar 2024 15:54 #296732 by Finngineering
Today I did some more tests, and below are the outcomes.

I enabled multiprocessor in BIOS settings. Without isolcpus and without IRQ rerouting, it did not work great. But when adding isolcpus and IRQ rerouting, the tmax values seem to top out around 1 000 000 (with 10 glxgears and youtube playing). Slightly better results with the Realtek cards than with the integrated Intel chipset, so I will use the Realtek for Mesa communication.

Changing back to the stock kernel caused no issues.

So, the problem is solved. And in the end it turned out to be only the unused Wifi card that I had plugged in. Ugh. I had that card installed in the computer already with LinuxCNC 2.8.4, and did not cause at least big issues then.

Please Log in or Create an account to join the conversation.

Moderators: PCWjmelson
Time to create page: 0.198 seconds
Powered by Kunena Forum