Nightmares in PC Building
You may remember my adventure of building a new pc. I had never put together the entire pc completely solo (I shot first), so it was quite the adventure. Awhile ago (actually quite awhile ago now, I am terrible at writing promptly) I started to have random problems, that took a lot of time to investigate, so I write this hoping to shed some light to others.
What I experienced, I believe, started as a few hard crashes. These did not appear to be blue screens. Or if they were, I did not see any sign of them as such. It was a quick hard break of all operations. The screen did not just freeze, but pixels cascaded from rainbow to black. Sound went all dial-up modem dial tone, briefly. It did not persist from what I remember (I am not positive on this detail, I may have had to shut the computer off myself).
This only happened a couple times, so I did not think much of it. But suddenly, shit got real. Programs would randomly crash. Browser tabs would uh oh and break. I started seeing blue screen after blue screen. At first, it seemed like I needed to be doing something (playing a game, watching videos), and then it just got more and more random. No discernible pattern. This time I could see that they were there, but usually not long enough to actually see the details. The PC would immediately restart. Windows like to place these bluescreens in minidump files. As of Windows 7, afaict, it does not come with a native tool to view these. So I used a program called BlueScreenViewer to take a peek. It looks to be lightweight and portable, and gives as much info as it can (whether or not that information is understandable or useful is another thing).
As you can see, the important thing from here is what the actual error message was, and what the filename that appears ti cause it. Interestingly enough, I am ran the program to grab a screen from a different PC, and this PC shows the filename for every file in the stack, where the affected PC only showed the ones in red that were allegedly the file that had the error. The errors I was receiving were:
- DRIVER_IRQL_NOT_LESS_OR_EQUAL
- MEMORY_MANAGEMENT
- PFN_LIST_CORRUPT
- IRQL_NOT_LESS_OR_EQUAL
- BUGCODE_USB_DRIVER
- CRITICAL_OBJECT_TERMINATION
And largely caused by ntoskrnl.exe, but sometimes some other system processes. This had all the symptoms of omgwtfbbq. From my research, these errors can be caused by any number of things (in no particular order):
- Specific drivers could be corrupt
- Memory could be going bad
- Something could be overheating (CPU, GPU, etc)
- Hard drive could be going bad and not reading correctly
- Power supply not providing consistent power
- Other hardware failure (CPU, GPU)
So, in other words, pretty much everything that could possibly be a problem, could possibly be a problem. Terrific.
Memory
I started with the easiest, and what seemed the most likely: memory. I picked up Memtest86+, which is pretty much the goto program for memory testing. You need to burn on ISO on a cd, or (what I find easier), install it on a USB drive and configure the BIOS to boot from whatever you chose. You’ll want to run the test in a number of configurations for multiple hours each. Test each stick at a time, and combinations of the sticks together. This will give the memory a good workout to see what errors will pop up. Moving the sticks around in different configurations will help try to narrow down if it is the memory slot, not the memory itself. For me, unfortunately, the memory performed just fine, even after running for 8+ hours in numerous configurations.
Drivers
Next, I went to the second easiest: drivers. Since it appeared somewhat graphical to me, okay I had no idea, but I did dork up a video driver install awhile ago. I tried to install the latest, but I ran into errors. Illustrious NSIS error when trying to install the AMD video drivers. This basically says the file failed a self check and thinks it is corrupt. Typically caused by an incomplete download, borked media, or hardware problem. I tried to download it multiple times, from multiple sources. Still gave me the error, which troubled me. Surely if many people were having this problem, it would have been noticed by the internet at large. So I doubt it was a problem with the file as downloaded. It must have been a problem in between.
Heat
Now, at this point I was thinking memory again. But I already tested memory, to shreds you say. So I thought back to Fifth Element and though, I need the heat. No wait, I need less the heat! I checked out SpeedFan at a friend’s recommendation. My ASUS board came with a decent motherboard monitoring application, but I figured the more the merrier! This one does have a few more bells and whistles, so it was worth it. I decided to try out prime95. While not specifically meant for stress testing, it will get the job done because calculating prime numbers is intensive stuff. And the program has the test to make sure the system will give correct results, because when a computer breaks under the stress, a prime number generator because slightly useless. Slightly.
Running prime95 would bring the CPU heat to not so happy levels. Not outside the threshold, but higher than I thought it should (I didn’t know what would actually be normal, but I built a pretty boss machine, it should be cool under the pressure). It also would sporadically error during the test. I remembered, at the time, I didn’t really know how to apply thermal paste. So I did some looking and learned that each CPU kinda has a best practice when it comes to applying. I did not apply that best practice for my Phenom II (which is a single pea sized dot in the middle and let it spread naturally when placing the cpu and heatsink together). So I decided to try that method out. It did seem to help heat wise, in actuality. Not significantly during idle and basic operation, but during the prime95 stress test, it capped it at a lower temp. While I occasionally got an error during tests, it was not nearly as frequent. However, the bluescreens kept on coming.
The Great Big Install
Okay, so, nothing has seemed to help at this point. I was starting to worry that it could be miscellaneous hardware related, but I did not want to go down that route just yet. I thought, maybe there could be something wrong with the Windows install. It’s a stretch, but just maybe it got broke in no-way-that-was-my-fault kind of way. If you remember, it pretty much only have the OS on the SSD. All my data, including my profile data (weee) is on the HDD. So it should just be a simple re-install and go. I have my own install CD and the key. So I did a clean, fresh with nothing else installed. And what do I get? Immediately a bluescreen. Cannot catch a break!
Hardware? Hardthere!
I started to fret a bit. So I put down the guitar. I had pretty much eliminated all the easy, free ways to go about this. It was time to start looking at warranties, with the one downside, I knew my particular processor was not in production anymore. I kept that last on the list, in the hope that I wouldn’t have to try that one because I likely would have had to downgrade from a six to a four core processor. I ended up replacing the motherboard, power supply, memory, and graphics card. I tried various configurations of the new parts. Some causing more trouble than I expected. At first it didn’t like my new power supply, and the board would not post. It would hang on something. Sometimes memory, sometimes the graphics card. I could not place what the problem was here. I re-seated everything and it seemed to come around eventually. I am actually not completely sure which ones are the new parts or not anymore. I may have lost track.
With new pieces of hardware in there, I was forced to do another fresh install of Windows. The Windows anti piracy check was kicking in because I changed configurations around a bit post install, so I wanted it to just be happy (well really, I had no choice other than to try to pick out what the install configuration was).
That is when all hell broke loose. Windows decided that install was not on the menu. I was greeting with bluescreens: A driver corrupted pool memory used for holding pages destined for disk (pretty sweet eh?)
and the most annoying error ever: Windows cannot install required files. The file does not exist. Make sure all files required for information are available, and restart the application. Error code: 0x80070003
I cannot even describe how frustrating this error is. There is very little useful information on this error. Some tell you to replace your storage drive or memory or the media you are installing from. I acquired other copies of the install, with no different result. I even tried to create partition on the HDD, and install there, but to no avail.
I thought to myself, what about linux. Computers love linux. It’s a fact, Jack! I installed a version of linux (Suse was the copy I had on hand) and that installed and ran flawlessly. I didn’t exercise it much, but enough for me to believe that I had gremlins or something.
What is that about SSDs?
At this point, I was a little lost, and looking for any straw I could find. Even that slightly bent one in the corner that leaks. I remembered a conversation with a friend about how his SSD had a hidden partition on it that contained miscellaneous dll’s and drivers. He noticed they would never go away, and when he tried to delete them it broke the install. I could not find exact confirmation of this, but I did notice a small, 100 mg partition on my SSD that I know I did not create. I could not exactly look at what was on the partition at this stage, but the idea seemed plausible.
I started poking around about installing on SSDs in general, and saw some threads about how it was important to actually wipe the partition on SSDs before doing a re-install. Oops, I know I did not do that. Was that my problem? Windows comes with secret command line and a simple partition manipulation program called Diskpart that is accessible during an attempted install. It is really basic, and when you chose to wipe a partition on a disk, it will wipe all of them on that disk. Luckily for me, that is what I wanted.
I’ll link to sevenforums.com that has some nice tutorials, but I am going to recap some of the steps. I used to Windows 7 full installer from the disk I had.
In the BIOS set the boot order to boot first from the CD/DVD Drive, insert the Windows 7 instalation DVD and restart the PC, at the first black/screen hit the space bar for the “Press any key … ” prompt, then at the “Language” screen hold the “Shift” key and hit the F10 hot-key to open a command window.
In the command window that opens type diskpart and press Enter to get started.
In the command prompt, type list disk and press Enter. NOTE: This will give you a list of disk numbers to select from. You don’t get much information other than size to determine which
In the command prompt, type select disk # and press Enter. NOTE: You would substitute # for the disk number listed that you want to use clean or clean all on. For example, I want to use one of them on Disk 1 (from step 1) for my USB key drive, so I would type select disk 1 and press Enter. You don’t get much other information other than size, so it helps if you can get far enough into the install process to see what the disk number is.
Clean – In the command prompt, type clean and press Enter. NOTE: This will not take long to finish. Think of it as being like a quick format.
or
Clean all – In the command prompt, type clean all and press Enter. NOTE: This will take quite some time (several hours or more) to finish depending on how large the disk is since it is writing over each and every sector on it to zero. Think of it as being like a full or low level format.
You do not want to use clean all often on SSDs. It will write 0 to every sector, which can shorten the lifespan of the SSD as writes are their Achilles’ heel.
To close the command window when finished type exit and press Enter to leave diskpart, then exit and press Enter again to close the command window and get back to the installer.
Wheel of Morality
Turn, turn, turn. Tell us the less that we should learn. With the partitions on the SSD wiped, I was able to install Windows just fine, and got things running. What did I learn about what was wrong? Absolutely nothing. In the end, I had mucked around with so many things, I really cannot be sure what the problem was in the end. Did some piece of hardware go bad? Did I have a heat problem tiring the bits out so they had to pull over and cool down? Did something go wrong on that hidden SSD partition? Hell if I know, but I do know that I now at least have some notes of what I did try. Maybe these will help you.
March 21st, 2013 at 20:04 pm
Hmm, I should really change that icon …
March 21st, 2013 at 21:36 pm
Holy crap! 14 months later, and he finally posts!
March 21st, 2013 at 21:48 pm
Damn skippy! I was sitting on those screenshots and links for forever.
March 21st, 2013 at 21:52 pm
Time to read it!
March 25th, 2013 at 13:03 pm
Looks to me like one of two things have happened:
1) Your Windows install got corrupted, to the point where a reformat / reinstall was needed to fix the problem.
2) Your motherboard is going bad on you, and needs to be replaced.
I couldn’t tell based on your post what all hardware you replaced. I would venture to guess that your motherboard is failing on you. It’s possible that the SSD was having some problems, and over time it slowly corrupted your OS, to the point where it needed to be reformatted / reinstalled. However, at times, this is actually a residual problem, brought on by an issue higher up in the “power chain”.
You verified that swapping out a PSU didn’t solve the problem. You also verified that RAM was good. Your processor *could* be a factor, but normally CPUs don’t crap out once they’ve been installed and used. If you had overclocked your motherboard (FSB, clock speed, multiplier, etc.) you might be creating one of those problems. But if that wasn’t done either, then I’d venture the motherboard.
The biggest thing to look at on the motherboard is the capacitors. Check to see if you have any popped, bulging, or leaking capacitors. Even ones that are ever so slightly bulging will change the corresponding voltage output to a point where it can start to create issues.
That’s my two cents, anyway…
March 25th, 2013 at 21:31 pm
Now you tell me!! I had replaced PSU, motherboard, and graphics. With little to no change in behaviour. It was all just very random. I would have expected some consistent behaviour if it was software related, but maybe not I guess. It really seemed like memory with how inconsistent it was, but I tested the crap outta that memory.
March 25th, 2013 at 21:34 pm
Hmm….Now since you reformatted and reinstalled, keep an eye on it. If it doesn’t do it again, then you’ve solved the problem.
And if it does still do that, you can check the mobo for popped caps.