Level X Error Question
We are using Azure ThreadX & FileX along with LevelX on a Winbond 1Gbit SLC NAND. File X version is 6.1 and Level X is 6.1.9.
We have noticed that during start-up, in the function _lx_nand_flash_open() , LX_SYSTEM_INVALID_FORMAT error(s) are being detected and logged. See attached screenshots showing the state of LX_NAND_FLASH data structure on error and soon after init.
It appears this error is not considered fatal by Level X as the process continues and eventually, the function returns LX_SUCCESS.
Is this error something to be concerned about?
What may be causing the error?
Hello @JohnR I noticed you have opened this issue in the github repo a while ago: https://github.com/azure-rtos/levelx/issues/12
I am sorry we didn't follow-up with you there. I will contact the Product Team to take a look at this and we will reply to you asap.
Appreciate your time!
Hello @JohnR Sorry for late reply. This LX_SYSTEM_INVALID_FORMAT error indicates that the sector value in the first page is different than that in the spare bytes. That means some of the data is different than when it was written. One of the causes may be that the hardware ECC changed the bit. If the hardware ECC is enabled, could you please try to disable it and see if this error still occurs. If the error still occurs, then the error may be caused by issues in the driver or hardware. For now, LevelX has issues with the hardware ECC support. We will fix them in the future release.
Thank you for the response. I have checked and yes we do have the hardware ECC enabled on the NAND. Besides disabling the ECC on the NAND, are there any configuration changes required in LevelX? I see that LevelX appears to have support for software ECC. Do I need to use those API(s) in place of the hardware ECC? Also, is it required to reformat the NAND on the first power-up with ECC disabled?
I also wanted to mention that these errors do not occur on all of our devices.
I disabled the hardware ECC and I am still seeing the problem. I also reformated the flash and then immediately reset the device, the errors appear on the first boot as well as successive boots.
Note that the flash that we are using has a 64-byte spare area. For the LevelX extra bytes, we are using spare 1 which is unused by the hardware ECC.
Have you tried erasing all the good blocks after disabling the ECC? If the problem still exists, there may be other issue causing the error.
OK, I erased the flash and the errors have now gone away. I assumed fx_media_format() erased all blocks. I guess it doesn't?
Can you explain why there is an issue with the Hardware ECC, in particular with our case? We store the LevelX extra bytes at Spare 1 offset 0. This location is not written to by anything else and it is not included in the ECC protected area for the spare data so it should not be affected.
I could see a potential problem if it were stored in User Data 1 and the hardware ECC was enabled. This is because when data is written to the sector, the chip would automatically write the calculated ECC to the spare area and then write the calculated ECC for the spare area. The problem is that the extra bytes from level X are not written at the same time. If LevelX comes back and then writes the extra bytes to that spare area, the previously written ECC for the spare area will not be correct. When the extra bytes are read out, the ECC algorithm will detect an error and attempt to correct the data, this will most likely will result in corruption of the extra bytes.
Writing the extra bytes or the software ECC values separately also increases the number of page programs. Thus potentially causing the NoP spec of the chip to be violated. There are some comments about this issue on your GitHub repo.
Sign in to comment
Sort by: Most helpful
Based on another issue I ran into, I now see why LevelX has issues with HW ECC. This is just one example, there may be more.
Due to a bug/design flaw in LevelX, this causes an issue when the _lx_nand_flash_block_reclaim() runs and attempts to "Write the erased started indication."
Basically what LevelX attempts to do is a read modify write of the NAND, without an erase, changing the value of a memory cell to LX_BLOCK_ERASE_STARTED. Since LX_BLOCK_ERASE_STARTED == 0 this would normally be OK, but the problem is the HW ECC for the sector has already been written based on the existing values. When verification is done after the write of LX_BLOCK_ERASE_STARTED, the verification fails because the return values are not correct. This is because the HW ECC attempts to correct what it assumes are errors.
Thank you so much for sharing your findings with the community @JohnR .
It is my understanding that for now the workaround provided by @Xiuwen Cai is ok to proceed!?
**Workaround: ** Disable hardware ECC and erase all the good blocks after disabling the ECC.
Product Team acknowledged this as a bug and will fix it in future release: "LevelX has issues with the hardware ECC support. We will fix them in the future release."
Let me know if the workaround is blocking your going-to-production and if you have an ETA? In order to prioritize the research and fix of this issue we would need you to create an Azure Support Request.
@António Sérgio Azevedo Yes, I believe that would prevent the issue I mentioned from occurring. I do think this means that it might be very difficult to then enable ECC in the future on products that shipped with it off.
It is unfortunate that because of the issue with NoP (Number of Partial programs), I do not believe software ECC can be used reliably either. The NoP issue even occurs without software ECC. Both of these issues could affect the stability and reliability of the filesystem so I would think they would be a high priority for the team.
Our product launch is in early Q3 of this year. I will file a support request as you mention.
Sign in to comment
Any updates on this and similar issues? When can an implementation that doesn't do multiple writes per page be expected?
Is there any public implementation of FileX+LevelX+NAND that is usable in industrial environments?
Microsoft communicated to me that the target date for a fix was in July, not sure if it has been fixed yet or not.
Sign in to comment
We are working on level nand fix to resolve the ECC issue and are currently in process of testing the new updates. The fix should be available soon.
Sign in to comment