Print freezes due to checksum mismatch

What is the problem?

In the serial communication happens a checksum mismatch, and the print freezes. These mismatches happen at seemingly random times, so the same gcode file has different progress on different attempts until the mismatch happen. After the serial mismatch, octoprint complains that the printer isn't giving feedback, but receives temperature state messages from the printer

What did you already try to solve it?

  • Changed USB cable
  • Isolated the +5V pin in the USB cable
  • Changed power supply, current supply is powering the Raspberry PI directly via a buck converter from the printer's PSU

Have you tried running in safe mode?

Yes

Did running in safe mode solve the problem?

Unfortunately not

Complete Logs

octoprint.log

serial.log
I'll try to attach, but these are the last few lines with the (in my eyes) important part

2021-03-03 12:41:07,999 - Recv: ok
2021-03-03 12:41:08,002 - Send: N6434 G0 F9000 X125.92 Y117.435*74
2021-03-03 12:41:08,034 - Recv: ok
2021-03-03 12:41:08,037 - Send: N6435 G1 F1500 X102.628 Y94.142 E540.75878*5
2021-03-03 12:41:08,753 - Recv: ok
2021-03-03 12:41:08,757 - Send: N6436 G0 F9000 X103.017 Y93.966*71
2021-03-03 12:41:08,798 - Recv: ok
2021-03-03 12:41:08,801 - Send: N6437 G1 F1500 X125.92 Y116.869 E541.62061*9
2021-03-03 12:41:09,530 - Recv: ok
2021-03-03 12:41:09,533 - Send: N6438 G0 F9000 X125.92 Y116.304*66
2021-03-03 12:41:09,565 - Recv: ok
2021-03-03 12:41:09,569 - Send: N6439 G1 F1500 X103.382 Y93.765 E542.46872*1
2021-03-03 12:41:09,852 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:76 B@:41
2021-03-03 12:41:10,326 - Recv: ok
2021-03-03 12:41:10,330 - Send: N6440 G0 F9000 X103.729 Y93.547*67
2021-03-03 12:41:10,363 - Recv: ok
2021-03-03 12:41:10,365 - Send: N6441 G1 F1500 X125.92 Y115.738 E543.30376*0
2021-03-03 12:41:10,432 - Recv: ok
2021-03-03 12:41:10,434 - Send: N6442 G0 F9000 X125.92 Y115.172*79
2021-03-03 12:41:10,455 - Recv: ok
2021-03-03 12:41:10,458 - Send: N6443 G1 F1500 X104.058 Y93.31 E544.12642*51
2021-03-03 12:41:11,098 - Recv: ok
2021-03-03 12:41:11,102 - Send: N6444 G0 F9000 X104.359 Y93.046*71
2021-03-03 12:41:11,132 - Recv: ok
2021-03-03 12:41:11,135 - Send: N6445 G1 F1500 X125.92 Y114.607 E544.93775*1
2021-03-03 12:41:11,855 - Recv:  T:200.03 /200.00 B:60.00 /60.00 @:75 B@:37
2021-03-03 12:41:12,592 - Recv: ok
2021-03-03 12:41:12,596 - Send: N6446 G0 F9000 X125.92 Y114.041*75
2021-03-03 12:41:12,629 - Recv: ok
2021-03-03 12:41:12,632 - Send: N6447 G1 F1500 X104.641 Y92.762 E545.73847*4
2021-03-03 12:41:13,853 - Recv:  T:200.03 /200.00 B:60.00 /60.00 @:75 B@:37
2021-03-03 12:41:14,061 - Recv: ok
2021-03-03 12:41:14,065 - Send: N6448 G0 F9000 X104.909 Y92.464*65
2021-03-03 12:41:14,090 - Recv: ok
2021-03-03 12:41:14,093 - Send: N6449 G1 F1500 X125.92 Y113.475 E546.5291*63
2021-03-03 12:41:15,497 - Recv: ok
2021-03-03 12:41:15,501 - Send: N6450 G0 F9000 X125.924 Y112.913*112
2021-03-03 12:41:15,532 - Recv: ok
2021-03-03 12:41:15,536 - Send: N6451 G1 F1500 X105.153 Y92.143 E547.31069*3
2021-03-03 12:41:15,584 - Recv: Error:checksum mismatch, Last Line: 6449
2021-03-03 12:41:15,852 - Recv:  T:200.00 /200.00 B:60.01 /60.00 @:76 B@:33
2021-03-03 12:41:17,853 - Recv:  T:199.97 /200.00 B:60.02 /60.00 @:76 B@:29
2021-03-03 12:41:19,853 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:76 B@:37
2021-03-03 12:41:21,875 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:76 B@:37
2021-03-03 12:41:23,861 - Recv:  T:200.03 /200.00 B:59.99 /60.00 @:74 B@:41
2021-03-03 12:41:25,860 - Recv:  T:200.03 /200.00 B:60.00 /60.00 @:74 B@:37
2021-03-03 12:41:27,860 - Recv:  T:200.00 /200.00 B:60.01 /60.00 @:75 B@:33
2021-03-03 12:41:29,861 - Recv:  T:200.03 /200.00 B:60.00 /60.00 @:74 B@:37
2021-03-03 12:41:31,859 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:75 B@:42
2021-03-03 12:41:33,860 - Recv:  T:200.00 /200.00 B:59.97 /60.00 @:75 B@:47
2021-03-03 12:41:35,860 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:75 B@:38
2021-03-03 12:41:37,860 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:75 B@:39
2021-03-03 12:41:39,859 - Recv:  T:200.00 /200.00 B:59.97 /60.00 @:75 B@:48
2021-03-03 12:41:41,859 - Recv:  T:200.03 /200.00 B:60.00 /60.00 @:74 B@:40
2021-03-03 12:41:43,860 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:75 B@:41
2021-03-03 12:41:45,861 - Recv:  T:200.07 /200.00 B:59.99 /60.00 @:73 B@:45
2021-03-03 12:41:45,862 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2021-03-03 12:41:45,876 - Send: N6452 M105*18
2021-03-03 12:41:45,889 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 6449
2021-03-03 12:41:47,863 - Recv:  T:200.07 /200.00 B:60.00 /60.00 @:73 B@:41
2021-03-03 12:41:49,861 - Recv:  T:200.14 /200.00 B:60.00 /60.00 @:72 B@:42
2021-03-03 12:41:51,861 - Recv:  T:200.10 /200.00 B:59.99 /60.00 @:73 B@:46
2021-03-03 12:41:53,860 - Recv:  T:200.14 /200.00 B:60.00 /60.00 @:72 B@:42
2021-03-03 12:41:55,860 - Recv:  T:200.21 /200.00 B:59.99 /60.00 @:70 B@:47
2021-03-03 12:41:57,860 - Recv:  T:200.07 /200.00 B:60.00 /60.00 @:73 B@:42
2021-03-03 12:41:59,861 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:74 B@:47
2021-03-03 12:42:00,385 - Changing monitoring state from "Printing" to "Cancelling"
2021-03-03 12:42:01,860 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:74 B@:46
2021-03-03 12:42:03,860 - Recv:  T:200.00 /200.00 B:60.01 /60.00 @:74 B@:38
2021-03-03 12:42:05,861 - Recv:  T:200.00 /200.00 B:60.01 /60.00 @:74 B@:37

Additional information about your setup

  • Ender 3 Pro
  • Creality 4.2.2 Board
  • Marlin Firmware 2.0.7.2 - not the bugfix and not the nightly, but only the last stable build
  • Raspberry Pi 3
  • Octoprint 1.5.3
  • Octopi 0.18.0

Attached full serial.log
serial.log (731.5 KB)

Your printer complains about a checksum mismatch but never requests an actual resend. That's a firmware bug or misconfiguration.

This is what things should look like:

Send: N6451 G1 F1500 X105.153 Y92.143 E547.31069*3
Recv: Error:checksum mismatch, Last Line: 6449
Recv: Resend:6450
Recv: ok

The temperature messages are out-of-band and don't factor in the timeouts for the actual active request/response communication with the printer, which - due to the missing resend request - is stalled here.

Thanks for your hint. I tried again, this time the print freezes again, but the printer did send the reset command. However, octoprint isn't sending in a format the printer accepts:

2021-03-03 16:27:29,962 - Send: N3293 G1 F3000 X116.977 Y111.177 E525.32872*61
2021-03-03 16:27:30,058 - Recv: ok
2021-03-03 16:27:30,061 - Send: N3294 G0 F9000 X117.63 Y108.378*76
2021-03-03 16:27:30,164 - Recv: ok
2021-03-03 16:27:30,167 - Send: N3295 G0 X118.786 Y107.725*18
2021-03-03 16:27:30,176 - Recv: Resend: 3295
2021-03-03 16:27:30,263 - Recv: ok
2021-03-03 16:27:30,265 - Send: N3295 G0 X118.786 Y107.725*18
2021-03-03 16:27:30,277 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 3295
2021-03-03 16:27:31,345 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:78 B@:41
2021-03-03 16:27:33,345 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:78 B@:46
2021-03-03 16:27:33,347 - Communication timeout during an active resend, resending same line again to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2021-03-03 16:27:33,358 - Send: N3295 G0 X118.786 Y107.725*18
2021-03-03 16:27:33,371 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 3295
2021-03-03 16:27:35,345 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:78 B@:41
2021-03-03 16:27:37,344 - Recv:  T:199.97 /200.00 B:59.99 /60.00 @:78 B@:45
2021-03-03 16:27:37,346 - Communication timeout during an active resend, resending same line again to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
2021-03-03 16:27:37,351 - Send: N3295 G0 X118.786 Y107.725*18
2021-03-03 16:27:37,369 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 3295
2021-03-03 16:27:39,345 - Recv:  T:200.00 /200.00 B:60.00 /60.00 @:78 B@:41
2021-03-03 16:27:41,345 - Recv:  T:200.00 /200.00 B:59.99 /60.00 @:78 B@:45
2021-03-03 16:27:41,347 - Communication timeout during an active resend, resending same line again to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.

This still looks wrong, now the Error is missing.

The printer requests line 3295 without a reason stated, OctoPrint still sends it, the printer complains about a line number mismatch but sends no resend request...

Something's broken here on the firmware side. Again, the firmware is supposed to send

  1. an Error line stating the problem
  2. a Resend line requesting the line number to be resent (I'd expect that to be 3296 here actually)
  3. an ok to signal that the next line may be sent

What we see here however is 2&3 at the start, and after that only 1. A valid communication error report consists of all of 1, 2 and 3, not a subset of them.

We've seen this before, it is a firmware bug - may or may not be fixed for the next release of Marlin, the bugfix branch. While they did break a lot of things in the process, reworking some of the serial code, it is apparently better now and fixed a bunch of weird issues that would occur with serial comms. I'm not explicitly recommending it but it might be fixed in the development branch, I hear 2.0.8 should be coming soonish...

Reference: Serial communications distrubs and does not recover Β· Issue #3917 Β· OctoPrint/OctoPrint (github.com)

I tried it with the bugfix branch as well, just in case. No success. But thanks for the hint.

Marlin issues are a bit of a nightmare to read, but maybe a search there might be worth it to see if there is something similar:

I am currently opening an issue in Marlin's bugtracker :wink:

1 Like

I've written the serial and queue changes. There was one issue with the previous queue code where, when one serial port had an error, Marlin would exit the command receiving loop, so any later serial port would be skipped (this usually creates new errors and so on).

This is fixed. There is another issue that Victor spotted yesterday that could happen if the printer is in a multiserial configuration and receiving commands on both serial port. It could then mess up the redirection (sending the error message on the wrong serial port). This will be fixed soon.

We still have a pending issue here (or misunderstanding) where the printer sends a "wait" command (in the firmware it's understood as "Me, the printer: I need more command, I'm waiting") and Octoprint does not fill in more commands.

Also, we have some lagging issues where the printers lags behind the host's commands (in that case, Octoprint is sending line N342... and the printer issue a "ok: N323 P0 B31" for example). I don't know the logic in Octoprint, if you are trying to interpret the "Bxx" part in the answer or not.

If you don't, I think Octoprint should skip sending command until the answer is not too far behind the next command's line so the print can resync with the host. In the example above, Octoprint should wait until it receives N341 for example before sending N343.

If you do, how do you find out what is the number of command buffers ? Do you ensure that the number of pending commands is still within the number of buffers or the buffer don't overflow ? In the example above, N342 - N323 means that there are 19 un-ack commands in the printer's (internal) serial buffer, not the command buffer yet (else Bxx would be lower in a 32 buffers scenario). I don't know if you account for the printer's serial buffer size here or not. This cause issues with the serial communication since at some point in time, the internal serial buffers will overflow (they can't store more data without dropping oldest value)

For reference (and numerous serial logs), you can check this link

1 Like

OctoPrint core currently doesn't take the the B, P and N parameters on the acknowledging ok into account, simply because that is stuff that isn't universally available by far and would require a completely different implementation to utilize (though there is a plugin that does utilize this information if available for better flow control).

The way things work is full out ping-pong. Printer sends an ok? OctoPrint sends the next line. Printer sends a resend request for line N? OctoPrint resets to that line and then (re)sends from there. This works across firmware reaching as far back as 2012, which many of the vendor forks out there shipped with printers sold these days still go back to. OctoPrint's job is to pump data to the printer as fast as it can. So if the printers tells it "ok, send more" it will send more. It's not the job of the host to manage the printer's internal buffers. In fact, considering the myriad of firmware variants out there, it's outright impossible to do this universally without breaking things for almost everyone.

It will do that, but only if it is printing AND allowed to do so. There's a configuration setting serial.supportWait that defaults to true. Why only while printing? Firmware will send this also while idle when no commands need to be sent right now, so in these cases it's ignored instead of trying to send something just for the sake of sending something. Why is this a setting? There's firmware out there that misuses wait as busy instead of the meaning you described, so treating it like an ok in these cases would be fatal. In any case, if you are seeing cases where a print job is in progress, wait support is enabled and it is NOT handled correctly, please open a bug report.

That makes sense. I've no idea what the OP had set in his octoprint settings (and we usually don't ask in Marlin's bug report).
It's interesting to know that the "advanced ok" mode is ignored by Octoprint. I'm new to the serial communication de-facto "standard" here so I don't know about the history. I'm resuming from the existing implementation in Marlin, and I've spotted few bugs and fixed them.

I wonder why it was written then if no host uses it.

I wonder why it was written and I learned about it from seeing it in a serial.log provided by users, instead of getting a ping about it or even a request for comment - communication is sadly not often optimal :sweat_smile: As far as I know it used to still be disabled at that point in time too, effectively a hidden feature that next to no one would even activate, so it didn't make sense to jump through hoops to make use of it then if only a fraction of printers OctoPrint would ever interface with would even have it :woman_shrugging:

Still, even if it is currently not used by OctoPrint itself, it does help, because it adds some debug context in the cases it is active and available. Knowing which line an ok is acknowledging is pretty crucial. Would be awesome if that could be relied upon for general printer communication (and would also open up things on the API like returning responses for commands) but alas, the firmware fragmentation pretty much makes that impossible right now.

There was BufferBuddy created, that uses advanced_ok to fill the buffer:

That's the plugin I meant, though I couldn't remember the name off the top of my head. Also why I said OctoPrint core doesn't utilize it. Awesome plugin though!

1 Like

I've no idea what the OP had set in his octoprint settings (and we usually don't ask in Marlin's bug report).

If this isnt enabled by default, i did not set it. The serial connection settings i have are mostly default, since that worked fine for me. The only setting i have changed is the option " Simulate ok for resend requests", which i set to "always" after @foosel mentioned a configuration issue, and before i opened the bug report on Marlin's github, since that option did not help either, and i was out of ideas.

That would only help if the problem was missing ok after a resend request, the issue we are looking at however is the resend request itself missing/being sent gods knows where.