Version 2.8.2 Debian Buster PREEMPT-RT ISO Freezes

More
27 Jun 2022 04:40 #245981 by Pro_El
Maybe you have same problem with mine. I have pc with 1 GB ram and it freezes after some time of using LCNC.
I watched the ram consumption and it slowly came up during some period.
can you confirm you have enough ram.

Please Log in or Create an account to join the conversation.

More
27 Jun 2022 11:20 #245990 by jcbryant
At the moment I've 2 GB of ram. I had 4 GB but took out half of it in the course of testing to see whether the freezes were due to a memory problem.

It seems rather unlikely that LinuxCNC has some form of memory leak. If it did everybody would ultimately have their systems freeze.

I'm coming to think that the problem has to be related to ethernet communication with the Mesa card. When LinuxCNC is just sitting there, it can't be doing much other than periodically exchanging packets with the card. A timer goes off, LinuxCNC receives and sends packets, and then LinuxCNC goes back to sleep. Perhaps the communication involves polling with interrupts disabled, and somewhat the awaited event never occurs? This would explain how the whole system freezes.

Please Log in or Create an account to join the conversation.

  • tommylight
  • tommylight's Avatar
  • Away
  • Moderator
  • Moderator
More
27 Jun 2022 12:32 #245997 by tommylight
There are a lot of reasons for a PC to freeze during use, with LinuxCNC the main reasons being processor overheating, voltage regulators for the processor overheating or old and on the way out, memory modules or controller failing, weak power supply due to dry capacitors, southbridge/northbridge overhating, crappy Intel engineering for the last 20 odd years, etc.
Even dust on the PCI / PCI-E slots can cause strange behaviour.
Also, RT kernels have a nack to refuse working with some hardware, mostly because that hardware has a crappy real time implementation. Remeber the Intel i915 chipset? Those issues are still in new chipsets, the software was made to work around them.
Not much help, but it gives an idea on what to check/change. Or try another PC.
The reason weak hardware shits its bed when using LinuxCNC is that LinuxCNC will use the RT kernel to disable idle states, set the max frequency, keep the processor ready at all times, etc, as can be seen if you monitor the processor temnerature and frequency, the temp skyrockets as soon as LinuxCNC is started. Hence the first thing to check is cooling.

Please Log in or Create an account to join the conversation.

More
27 Jun 2022 12:55 #246003 by jcbryant
I did try monitoring the CPU temperature, and it never goes above 68C, even with LinuxCNC running.

I've just been playing around with things, and think that I may be getting somewhere.

Running a browser at the same time as LinuxCNC causes things to fall apart pretty dramatically.  I had an 7i76e error message ("Dolt not cleared from previous thread") show up, and found myself in a situation where LinuxCNC was responding but any attempt to move any of the axes produced no movement and a "Joint n following error",  Also there was that immediate failure last night when, because I'd been looking at the forum, I just happened to have a browser running.

At the moment I've LinuxCNC running with the Intenet Ethernet cable disconnected.  I've had a latency error message and a repeat of the 7i76e message (perhaps 5.10 has made things worse in this respect - I've never seen this message before) but at least the system has run for an unusually long time without freezing.

At the moment I have the Internet connected to LAN connector #1 (the "default" as reported by the network manager) and the Mesa card connected to LAN connector #2.  Perhaps Connector #1 has some form of priority, any activity on it interferes with activity on Connector #2, and I should try swapping things around?

 

Please Log in or Create an account to join the conversation.

More
27 Jun 2022 23:09 #246030 by jcbryant
It seems that I have at least managed to convert my problem into something more definitive.  LinuxCNC now survives for no more than a few seconds before running into a very definite and repeatable error.

The machine has two LAN ports: enpls0 ("wired connection 1") and ens1 ("wired connection 2"). Originally the internet was connected to enpls0, the Mesa card was connected to ens1, and the network manager showed enpls0 as the "default".

I've now swapped things around. The Mesa card is now connected to enpls0, the internet to ens1, and at some point in the process the network manager began showing ens1 as the "default".

I can ping the Mesa card. The very first round trip always seems to take a tad longer, but after that round trip times look pretty reasonable (e.g: min 0.099ms, avg 0.111ms, max 0.124 ms). And I can use mesaflash to extract information from the card. So Mesa card communications would appear to be working properly.

And I'm able to browse the Internet.

But when I start up LInuxCNC and try to do anything I almost immediately get

hm2/hm2.7i76e.0: error finishing read, iter = 58860

The iter value is typically somewhat smaller.

Once this has happened any attempt to move any axis produces "joint n following error" (where "n" is the axis number) and no movement at all.

And just to top things off, I am now seeing the latency warning message on a fairly regular basis. When I run the latency test, jitter for both threads is about 50000us. Not great, and very marginal for software stepping, but perhaps more than fine for a Mesa card?  I've no idea as to why the values are so high and how they might be improved. The only thing that I can think of is isocpus and I'm not sure whether this might be helpful.  I've done all I can with the BIOS - unfortunately there are very few things than can be disabled.

One strange thing I noticed in configuring the Mesa network connection is that things work if one simply omits a "gateway" value, but that if one actually enters the suggested value (10.10.10.1) then LinuxCNC fails on startup and produces an error file.  Can anybody explain what is going on here?

And can anybody explain how I'm getting  "finishing reads" errors when communications would appear to be functional, round trip times look reasonable, and latency values seems to be within reason (for a Mesa card)?

I really am at my wits end, and I am beginning to think that another PC may be my only option.
 

Please Log in or Create an account to join the conversation.

  • tommylight
  • tommylight's Avatar
  • Away
  • Moderator
  • Moderator
More
27 Jun 2022 23:45 #246033 by tommylight
In the ini file, what is the servo period set to?
It must be 1000000 or 1 million.

Please Log in or Create an account to join the conversation.

More
27 Jun 2022 23:59 #246035 by jcbryant
The servo period is 1,000,000.

When ping is allowed to run for a while, the max round trip time goes up to about 0.6 ms. Which is getting close to the servo period of 1ms. What might make some communications take so long when the average remains not far above 0.1ms?

Please Log in or Create an account to join the conversation.

  • tommylight
  • tommylight's Avatar
  • Away
  • Moderator
  • Moderator
More
28 Jun 2022 00:35 #246038 by tommylight
Intel network cards?
Try this:
sudo ethtool -C eth0 rx-usecs 0
Then check the ping times.

Please Log in or Create an account to join the conversation.

More
28 Jun 2022 00:47 #246040 by jcbryant
I've no ethtool and when I tried to install it I ran into an issue:

john@Taig:~$ sudo apt install ethtool
Reading package lists... Done
Building dependency tree
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
linux-image-5.10.0-13-rt-amd64 : Breaks: wireless-regdb (< 2019.06.03-1~) but 2016.06.10-1 is to be installed

I've only the vaguest idea what this means, and I've no idea how to proceed....

Please Log in or Create an account to join the conversation.

More
28 Jun 2022 11:43 #246071 by jcbryant
When I try "apt --fix broken iinstall" the suggestion is to remove linux 5.10.

Presumably this would get me back to where I was before, and this might not be so bad an idea given that 5.10 doesn't seemed to have solved anything. It would be interesting to see whether I get the same read errors with the original version of linux and the LAN ports swapped. This would indicate an issue with enpls0.

But unfortunately I'd be no closer to explaining or eliminating the freeze ups.

Please Log in or Create an account to join the conversation.

Time to create page: 0.077 seconds
Powered by Kunena Forum