The software architecture is bootstrap + uboot + linux (2.6.30). Most of the mimic work well, while several don't boot successfully. The issue is that one or two bits of local variables on stack (DDR) are incorrect. It usually happened on nand ECC calculation, if enable ECC soft. About 4% nand page ECC calculation hit ECC error, that is the calculated ECCs are not equal to the ECCs read from nand. When the ECC error occurs, we invoke nand_calculate_ecc function again, the second result mostly is equal to the ECC read from nand. We digged into the local variables on stack (DDR) for the two ECC calculation processes and noticed they are different. The following log shows bit23~20 are incorrect on some variables on the two ECC calculations, like B7D36895 and B7936895, 24830BA6 and 24C30BA6.
[ 5.512000] rp0 B7D36895 2E2FE10D A090C0AC 24830BA6 - F586C593 FF278FA7
[ 5.537000] rp0 B7936895 2E2FE10D A090C0AC 24C30BA6 - F586C593 FF278FA7
We have made some experiments, like exchanging the SoC chip, nand flash and DDR on boards. It looks like the issue exactly follows SoC chips, which means the issue always happens on whichever boards the specific SoC chips go.
My questions are:
1. Are those SoC chips defective, fake or anything else?
2. Are there some variations for SAM9M10-G45 and ways to detect the variations, so that we can fix or patch them from Linux community by software.
3. If those SoC chips are defective or fake, are there any easy method to detect them? By now we detect the issue by exchanging them on boards and it takes too much effort...
