Continued random shutdown of 5125 / 7i77 boards

More
10 Jan 2017 18:22 - 10 Jan 2017 18:24 #85577 by an92626
I am continuing to get random shutting down of my 5i25/7i77 boards. Initially it was thought that this was caused by a latency issue with the based thread after enabling the base thread to drive the fourth axis using the parallel port. I successfully moved the fourth axis off the parallel port onto the P2 port of the 5i25 board which enabled the stepper motor to be driven via the fpga on the 5i25 and allowed me to delete the base thread, but the problem of the 5i25 and 7i77 boards randomly shutting down continues.

At random times, both when linuxcnc is running and moving the tool, and when linuxcnc is idle and no motion is occurring, I get the series of error messages as shown below.



I am not sure whether it is the "hm2/hm2_5i25.0: sserial_write: Timeout waiting for CMD to clear" error that occurs first, or whether it is the "hm2/hm2_5i25.0: Smart serial card hm2_5i25.0.7i77.0.0 error = (6) Remote Fault" that occurs first. Once the faults occur linuxcnc losses control of the 5i25 / 7i77 boards and they are non-responsive. If I try to restart Linuxcnc I get the errors

hm2/hm2_5i25.0: invalid cookie, got 0xFFFFFFFF, expected 0x55AACAFE
hm2/hm2_5i25.0: FPGA failed to initialize, or unexpected firmware?
hm2_5i25.0: board fails HM2 registration
RTAPI_PCI: Unmapped 65536 bytes at 0x7f9291d30000
Driver probe function failed!
hm2_pci: error registering PCI driver

Since the error message does refer to the pci driver, I am wondering if the issue is in the PC and not the 5i25/7i77 boards. I am running 64 bit Linux LMDE2 with Linux vertmill 3.18.13-rt10mah+ #3 SMP PREEMPT RT Sat Jun 20 00:51:55 CEST 2015 x86_64 GNU/Linux.

The only way to clear the system is to do a complete system restart which 95% of the time works. One or twice the errors continue, but a second restart has always worked.

Does anyone have any idea what would be causing this or where I could look to figure out the cause?

Thanks.
Attachments:
Last edit: 10 Jan 2017 18:24 by an92626.

Please Log in or Create an account to join the conversation.

More
10 Jan 2017 18:40 - 10 Jan 2017 18:47 #85579 by lincamx
The only time I have had any smart serial card error is when I was using the wrong kernel,
Shouldn't you be using the RTAI kernel and not the PREEMPT RT kernel with the 5i25 card.

The PREEMPT RT kernel for now is for the uspace and Ethernet card
Last edit: 10 Jan 2017 18:47 by lincamx.

Please Log in or Create an account to join the conversation.

More
10 Jan 2017 18:52 #85580 by PCW

hm2/hm2_5i25.0: invalid cookie, got 0xFFFFFFFF, expected 0x55AACAFE


Most likely means a hardware error of some kind either bad/dirty PCI slot
power supply issue (low 3.3V) , bad 5I25 etc

Please Log in or Create an account to join the conversation.

More
10 Jan 2017 19:48 #85582 by an92626
It get the "invalid cookie" error after it errors out in llinuxcnc and I close linuxcnc and then try to re-launch the application. I can see where if something caused the initial error in linuxcnc and apparently shut down the 5i25/7i77 boards then simply trying to restart linuxcnc would result in a hardware error that would be similar to a bad PCI slot, power supply issue, bad board issue. I am not a believer in the bad board reasoning since the system can run on multiple days with no error at all and do more complex moving and control of the mill than the simple or no motion that is taking place when an error occurs and shuts the boards down.

I have no idea about whether the Prempt PT kernel is not applicable for the 5i25 board. The system can run for several days and not have any problem, and then it returns these errors within linuxcnc and shuts down. If the kernel is not applicable then I would think the shut downs were happen more often, like always.

I think the solution would be in evaluating the errors listed by linuxcnc when the board initially shuts down. I am not sure how linuxcnc lists its errors, whether the first error is on the top or bottom of the error list, some I guess the first question is whether "hm2/hm2_5i25.0: sserial_write: Timeout waiting for CMD to clear" error occurred first and if so, what is the cause of such an error, or secondly, whether the "hm2/hm2_5i25.0: Smart serial card hm2_5i25.0.7i77.0.0 error = (6) Remote Fault" error occurred first and what is the cause of this error. Does anyone know how linuxcnc lists its errors and where the first error appears at the top or bottom on the screen listing so I know where to start back-tracking the problem?

Secondly, is there a listing anywhere of what error 6 Remote Fault means or what a "timeout waiting for cmd to clear" error means? Once I figure out which error occurred first, I probably need to try to figure out the root cause of the first errro to occur.

One interesting note on this particular failure is that when the failure occurred this time, I was simply using the MPG pendant to manually move the mill in the x/y directions to mill down the surface of a piece of aluminum that would be used as a tooling spacer. Since I was only moving in the X/y direction with the material held in a vise, I had unplugged the fourth axis unit which raises the question how I got a following error on joint 3, which is only a stepper motor with no feedback whatsoever, and I was not even moving joint 3?

Please Log in or Create an account to join the conversation.

More
10 Jan 2017 20:07 - 10 Jan 2017 20:58 #85586 by PCW
The errors listed are most likely meaningless, they are the side effects of losing communication
with the 5I25 board. It's more likely there's a systemic problem that caused this loss of communication.

I would try a different PC power supply since low 3.3V would cause these exact symptoms:

Low 3.3V resets the FPGA so its reloaded with its PCI BAR cleared so its no longer accessible
until you reset the host.
Last edit: 10 Jan 2017 20:58 by PCW.

Please Log in or Create an account to join the conversation.

More
11 Jan 2017 23:03 #85634 by an92626
I tried look at the power supply and bad board possibility with no success. I changed out the 5i25 for a 6i25 which moved the board from the pci to the pcie slot, and no change in the occurrence of these errors and the board shutting down. I even reloaded the entire operating system and software with no improvement.

With no other progress, I went in and removed the video card that was in the pcie16 slot, switched the video back to the on-board graphics, made a couple of changes in the bios, and so far I have not had any recurrence of the 5i25/7i77 broads shutting down. Not sure if it makes sense of changing out the video card and this stops power related problems on the 5i25/7i77 boards, but unless the problem occurs again, so far it seems to have fixed the problem.

Please Log in or Create an account to join the conversation.

More
11 Jan 2017 23:35 #85637 by PCW
I have seen similar problems and they are all related to 3.3V dropping out (going below 3V)
this causes the 5I25/6I25 to reset so all the PCI enumeration info is lost and the card is therefore not
accessible until a system reset. Its possible that the video card draws enough power to affect the PS regulation
enough to cause the 3.3V dropout

Please Log in or Create an account to join the conversation.

Time to create page: 0.379 seconds
Powered by Kunena Forum