Ecc Vs Non Ecc
Jun 23, 2012 UDIMM versus RDIMM, ECC versus NON-ECC Hi all, I'm in the process of ordering a new computer and am not tech minded and have been limping along on an old Precision 690 XEON 5140, so please bear with me. Mar 10, 2019 - ECC memory vs. Non-ECC memory. Physically, ECC memory differs from non-ECC memory (like what consumer laptop / desktop RAM uses) in.
So, I have a question that is a bit different from the 'typical' /r/buildapc post.
I've been thinking about building a PC for workstation use (in particular for processing and analyzing large time series datasets, where 'large' is a minimum of ~50-100 GB and up to a few TB). Ive more or less narrowed down my choices to skylake-x (in particular the 10-core and 14-core varients) and xeon-w (probably the 10-core variant, though thats TBD). In either case I will probably start out with 128gb of ram, though there are some (very expensive) options for xeon-w that would allow 512gb.
I mostly understand the main trade-offs here: with xeon-w I get expanded memory support (128gb-->512gb) and ECC memory (plus maybe one or two other things that I probably wont use) at a ~40% higher cost. From a quick search, ECC memory seems to be even more expensive than normal ddr4 ram (on Newegg it goes for ~$200/16gb), and the mobo is likely going to cost more too (xeon-w doesnt use x299). With skylake-x you gain overclocking support on both CPU and memory and the cpu/memory is cheaper, but you lose the ECC memory and are limited to 128gb.
Of these, ECC memory is the only one that I dont have a good grasp of real-world benefit. I guess my main questions are:
Is ECC and non-ECC performance identical at the same frequency? Or is there an additional performance hit from the error correcting. (ofc they arent at the same frequency, but I want to know if the expected difference would be the same as comparing 2 different non-ECC varients at different frequencies). I havent delved into looking into memory timings for ECC ddr4, but that might partially answer this question (since timings and frequency both effect performance).
What is a realistic error rate that I would expect without ECC, particularly with a 128gb setup (i.e., how often will crashes be avoided by using ECC memory? is this a 'once in the next 5 years' or 'once a week' type of deal?)
Any other major differences between ECC and non-ECC memory that I am missing?
Or, put in other words, comparing the 10-core variants means that Im paying ~$1000 more (in total the system cost will probably jump from ~$3000 to ~$4000 after adding a GPU, motherboard, etc.) and sacrificing overclocking, mostly to gain ECC support and possible future memory expansion (memory is too damn expensive right now for me to seriously consider getting 32/64 gb dimms, let alone getting 8 of them). Is this worth it?
Thanks in advance.
.
.
.
A few notes (feel free to skip these):
.
NOTE 1: I could see a 128gb limit being somewhat restrictive for what I want to do. I can do what I need to with this, but it will have to sacrifice some vectorized operations and will need to explicitly split up the data into smaller chunks. That said, skylake-x is compatable with intel's optane tech, which seems like it could add some additional (albiet slower) memory. I wouldnt plan on depending on this for primary memory usage, but i could see it being handy for dealing with memory spikes for certain operations (in effect, Im thinking it is effectively a faster pagefile/swap). Anyone have experience with this? Would I be better off just relying on a standard pagefile/swap and a pcie ssd?
.
NOTE 2: yes, I know about threadripper.
Threadripper is great for a lot of people, but unfortunately not for my application. This is because 1) I should be able to utilize AVX-512 either immediately or in the near future, and 2) my application is rather sensitive to memory latency but isnt explicitly NUMA aware. Issue #2 is the main reason.
In the past I have run this code on a sandy-bridge 2-socket xeon system, and found that using both NUMA nodes only improved performance by ~5% (on average), despite having double the available resources. This was admittedly on an older linux kernal (from around the time when sandybridge xeons were new), but I dont trust that the OS (probably windows 10, depending on if WSL can handle all my linux/GNU needs) will be able to schedule threads well enough to significantly improve the situation.
.
NOTE 3: I would like to avoid going with the mainstream xeons for 1 reason: cost. Ive considered them, and to me it seems that I would be paying a lot more for really not that much extra functionality (for my usage anyways). I dont need/want a multi-socket design (I want to avoid NUMA, otherwise I would go with threadripper or Epyc). I dont need most of the extra server-oriented tech. The only real tangible benefit I see is 6 memory channels instead of 4 (which, admittedly, would be nice, but not 'double the cost' nice). I think the vast majority of the other 'nice' mainstream xeon aspects are available in xeon-w for much cheaper (except the 6 channel memory, unfortunately).
.
NOTE 4: I dont really care about sacrificing CPU overclocking, since these chips run hot as it is and already have to downclock to run AVX-512. Memory overclocking is a slightly bigger deal, as I would get a minimum of 3200 MHz (and possibly 3600+ MHz) from non-ECC memory versus 2400 MHz or 2666 MHZ out of ECC, and my application is at least partially memory bound. ofc timings play a role too, but I would imagine that I would be getting decently better performance from the non-ECC (my gut tells me 10-20%, though that is 100% a guess and not based on real-world testing).
Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the most-common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.
Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state.[2] Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.
Problem background[edit]
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.[3] Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).[4] As a result, systems operating at high altitudes require special provision for reliability.
As an example, the spacecraft Cassini–Huygens, launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Thanks to built-in EDAC functionality, spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four for that single day. This was attributed to a solar particle event that had been detected by the satellite GOES 9.[5]
There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower-energy particles will be able to change a memory cell's state.[4] On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies[6] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10 error/bit·h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit·h (roughly one bit error per millennium per gigabyte of memory).[6][7][8] A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference.[7] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (roughly 2.5 × 10−11 error/bit·h) and 70,000 (roughly 7 × 10−11 error/bit·h, or 5 bit errors per 8 gigabytes of RAM per hour) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.
The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes.[7] Memory errors can cause security vulnerabilities.[7] A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.[9]
Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cell to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits.[10][11]
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character '8' (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the '8' (0011 1000 binary) has silently become a '9' (0011 1001).
Solutions[edit]
Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting code, a single-error correction and double-error detection (SECDED) Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.
Implementations[edit]
Seymour Cray famously said 'parity is for farmers' when asked why he left this out of the CDC 6600.[12] Later, he included parity in the CDC 7600, which caused pundits to remark that 'apparently a lot of farmers buy computers'. The original IBM PC and all PCs until the early 1990s used parity checking.[13] Later ones mostly did not. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.[citation needed]
Ecc Vs Non Ecc Gaming
An ECC-capable memory controller can detect and correct errors of a single bit per 64-bit 'word' (the unit of bus transfer), and detect (but not correct) errors of two bits per 64-bit word. The BIOS in some computers, when matched with operating systems such as some versions of Linux, macOS, and Windows,[citation needed] allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Some DRAM chips include 'internal' on-chip error correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory.[14][15] In some systems, a similar effect may be achieved by using EOS memory modules.
Error detection and correction (EDAC) depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including IBM's Chipkill, Sun Microsystems' Extended ECC, Hewlett Packard's Chipspare, and Intel's Single Device Data Correction (SDDC).
DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also 'scrub' the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.
Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.[16]
Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy (TMR). The latter is preferred because its hardware is faster than that of Hamming error correction scheme.[16] Space satellite systems often use TMR,[17][18][19] although satellite RAM usually uses Hamming error correction.[20]
Many early implementations of ECC memory mask correctable errors, acting 'as if' the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events.[21]
Many ECC memory systems use an 'external' EDAC circuit between the CPU and the memory. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct.[14] Modern desktop and server CPUs integrate the EDAC circuit into the CPU,[22] especially with the shift toward CPU-integrated memory controllers, which are related to the NUMA architecture.
As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for protecting memory – double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error correcting and double-nibble error detecting (SNC-DND) codes, Reed–Solomon error correction codes, etc. However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes.[23][24]
Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with odd weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area and delay.[25][26][27]
Cache[edit]
Many processors use error-correction codes in the on-chip cache, including the Intel Itanium and Xeon[28] processors, the AMD Athlon, Opteron, all Zen-[29] and Zen+-based[30] processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.[23][31]
As of 2006, EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache.[32] CPUs that use the EDC/ECC technique always write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.
Registered memory[edit]
Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is neither, for economy. However, unbuffered (not-registered) ECC memory is available,[33] and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC.[34]Registered memory does not work reliably in motherboards without buffering circuitry, and vice versa.
Advantages and disadvantages[edit]
Ultimately, there is a trade-off between protection against unusual loss of data, and a higher cost.
ECC protects against undetected memory data corruption, and is used in computers where such corruption is unacceptable, for example in some scientific and financial computing applications, or in file servers. ECC also reduces the number of crashes that are especially unacceptable in multi-user server applications and maximum-availability systems. Most motherboards and processors for less critical application are not designed to support ECC so their prices can be kept lower. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if the ECC RAM is installed.
ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards, chipsets and processors that support ECC may also be more expensive.
ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking.[35] However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.[22][36][37]
References[edit]
- ^Werner Fischer. 'RAM Revealed'. admin-magazine.com. Retrieved October 20, 2014.
- ^'A survey of techniques for improving error-resilience of DRAM', JSA, 2018
- ^Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle, WA 98124-2499
- ^ ab'A Survey of Techniques for Modeling and Improving Reliability of Computing Systems', IEEE TPDS, 2015
- ^Gary M. Swift and Steven M. Guertin. 'In-Flight Observations of Multiple-Bit Upset in DRAMs'. Jet Propulsion Laboratory
- ^ abBorucki, 'Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level', 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
- ^ abcdSchroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). DRAM Errors in the Wild: A Large-Scale Field Study(PDF). SIGMETRICS/Performance. ACM. ISBN978-1-60558-511-6. Lay summary – ZDNet.
- ^'A Memory Soft Error Measurement on Production Systems'.
- ^Li, Huang; Shen, Chu (2010). ''A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility'. Usenix Annual Tech Conference 2010'(PDF).
- ^Yoongu Kim; Ross Daly; Jeremie Kim; Chris Fallin; Ji Hye Lee; Donghyuk Lee; Chris Wilkerson; Konrad Lai; Onur Mutlu (2014-06-24). 'Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors'(PDF). ece.cmu.edu. IEEE. Retrieved 2015-03-10.
- ^Dan Goodin (2015-03-10). 'Cutting-edge hack gives super user status by exploiting DRAM weakness'. Ars Technica. Retrieved 2015-03-10.
- ^'CDC 6600'. Microsoft Research. Retrieved 2011-11-23.
- ^'Parity Checking'. Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
- ^ abA. H. Johnston.'Space Radiation Effects in Advanced Flash Memories'Archived 2016-03-04 at the Wayback Machine.NASA Electronic Parts and Packaging Program (NEPP).2001.
- ^'ECC DRAM – Intelligent Memory'. intelligentmemory.com. Retrieved 2014-12-23.
- ^ ab'Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite'. Tsinghua Space Center, Tsinghua University, Beijing. Archived from the original on 2011-10-02. Retrieved 2009-02-16.
- ^'Actel engineers use triple-module redundancy in new rad-hard FPGA'. Military & Aerospace Electronics. Retrieved 2009-02-16.
- ^'SEU Hardening of Field Programmable Gate Arrays (FPGAs) For Space Applications and Device Characterization'. Klabs.org. 2010-02-03. Archived from the original on 2011-11-25. Retrieved 2011-11-23.
- ^'FPGAs in Space'. Techfocusmedia.net. Retrieved 2011-11-23.[permanent dead link]
- ^'Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment'. Radhome.gsfc.nasa.gov. Retrieved 2011-11-23.
- ^Doug Thompson, Mauro Carvalho Chehab.'EDAC - Error Detection And Correction'Archived 2009-09-05 at the Wayback Machine.2005 - 2009.'The 'edac' kernel module goal is to detect and report errors that occurwithin the computer system running under linux.'
- ^ ab'AMD-762™ System Controller Software/BIOS Design Guide, p. 179'(PDF).
- ^ abDoe Hyun Yoon; Mattan Erez. 'Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches'. 2009. p. 3
- ^Daniele Rossi; Nicola Timoncini; Michael Spica; Cecilia Metra.'Error Correcting Code Analysis for Cache Memory High Reliability and Performance'Archived 2015-02-03 at the Wayback Machine.
- ^Shalini Ghosh; Sugato Basu; and Nur A. Touba. 'Selecting Error Correcting Codes to Minimize Power in Memory Checker Circuits'Archived 2015-02-03 at the Wayback Machine. p. 2 and p. 4.
- ^Chris Wilkerson; Alaa R. Alameldeen; Zeshan Chishti; Wei Wu; Dinesh Somasekhar; Shih-lien Lu. 'Reducing cache power with low-cost, multi-bit error-correcting codes'. doi: 10.1145/1816038.1815973.
- ^M. Y. Hsiao. 'A Class of Optimal Minimum Odd-weight-column SEC-DED Codes'. 1970.
- ^Intel Corporation.'Intel Xeon Processor E7 Family: Reliability, Availability, and Serviceability'.2011.p. 12.
- ^'AMD Zen microarchitecture - Memory Hierarchy'. WikiChip. Retrieved 15 October 2018.
- ^'AMD Zen+ microarchitecture - Memory Hierarchy'. WikiChip. Retrieved 15 October 2018.
- ^Jangwoo Kim; Nikos Hardavellas; Ken Mai; Babak Falsafi; James C. Hoe.'Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding'.2007.p. 2.
- ^Nathan N. Sadler and Daniel J. Sorin.'Choosing an Error Protection Scheme for a Microprocessor’s L1 Data Cache'.2006.p. 1.
- ^'Typical unbuffered ECC RAM module: Crucial CT25672BA1067'.
- ^Specification of desktop motherboard that supports both ECC and non-ECC unbuffered RAM with compatible CPUs
- ^'Discussion of ECC on pcguide'. Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
- ^Benchmark of AMD-762/Athlon platform with and without ECCArchived 2013-06-15 at the Wayback Machine
- ^'ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All'. Systems and Network Security Group at VU Amsterdam. Retrieved 2018-11-22.
External links[edit]
Ecc Vs Non Ecc Compatibility
Comments are closed.