I've recently acquired an AMD64 box (dual Opteron 242, SiS Master@-FAR motherboard (http://www.msi.com.tw/program/products/server/svr/pro_svr_detail.php?UID=484). See below for more details). I find it very unstable running with 8 GB memory, though 4 GB are not a problem. At first I thought it was the onboard peripherals, but after disabling them it still persisted.
What's unstable? I only once got it through the boot process. Running a 5.3-RELEASE i386 kernel it panics, though I haven't investigated the panic (yet), since I'm not interested in the i386 kernel. The amd64 5.4-PRERELEASE kernel just hangs/freezes. When the peripherals are enabled, it's after probing the onboard NIC (bge) and before probing SATA (no drives present). I've done a verbose boot, of course, but no additional information is present. The NIC is recognized, and that's all.
Without the peripherals, but with a 3Com 3c905 PCI NIC, it continues beyond this point, but doesn't enable the NIC. I don't have dmesg output for these attempts, so I can't produce the exact message, and I suspect it's not important. It continues until trying to mount NFS file systems, where it hangs for obvious reasons. Pressing ^C causes the system to either panic (and be unable to dump because I don't have that much swap) or just hang.
None of these problems occur when I use 4 GB memory. About the only strangeness, which seems to come from the BIOS, is that it recognizes only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB memory.
I realize that this isn't enough to diagnose the problem. The reason for this message now is to ask:
1. Has anybody else seen this problem? 2. Has anybody else used this hardware configuration and *not* seen this problem? 3. Where should I look next?
I'm attaching the (non-verbose) dmesg from a successful boot.
Greg See complete headers for address and phone numbers.
Mar 30 14:17:16 obelix kernel: FreeBSD 5.4-PRERELEASE #0: Tue Mar 22 04:02:17 UTC 2005 Mar 30 14:17:16 obelix kernel: root@obelix:/usr/obj/src/FreeBSD/OBELIX/src/sys/OBELIX Mar 30 14:17:16 obelix kernel: Timecounter "i8254" frequency 1193182 Hz quality 0 Mar 30 14:17:16 obelix kernel: CPU: AMD Opteron(tm) Processor 242 (1603.65-MHz K8-class CPU) Mar 30 14:17:16 obelix kernel: Origin = "AuthenticAMD" Id = 0xf5a Stepping = 10 Mar 30 14:17:16 obelix kernel: Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE3 6,CLFLUSH,MMX,FXSR,SSE,SSE2> Mar 30 14:17:16 obelix kernel: AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow+,3DNow> Mar 30 14:17:16 obelix kernel: real memory = 3756916736 (3582 MB) Mar 30 14:17:16 obelix kernel: avail memory = 3623907328 (3456 MB) Mar 30 14:17:16 obelix kernel: ACPI APIC Table: <VIAK8 AWRDACPI> Mar 30 14:17:16 obelix kernel: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs Mar 30 14:17:16 obelix kernel: cpu0 (BSP): APIC ID: 0 Mar 30 14:17:16 obelix kernel: cpu1 (AP): APIC ID: 1 Mar 30 14:17:16 obelix kernel: ioapic0: Changing APIC ID to 2 Mar 30 14:17:16 obelix kernel: ioapic0 <Version 0.3> irqs 0-23 on motherboard Mar 30 14:17:16 obelix kernel: acpi0: <VIAK8 AWRDACPI> on motherboard Mar 30 14:17:16 obelix kernel: acpi0: Power Button (fixed) Mar 30 14:17:16 obelix kernel: Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 Mar 30 14:17:16 obelix kernel: acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 Mar 30 14:17:16 obelix kernel: cpu0: <ACPI CPU> on acpi0 Mar 30 14:17:16 obelix kernel: cpu1: <ACPI CPU> on acpi0 Mar 30 14:17:16 obelix kernel: acpi_button0: <Power Button> on acpi0 Mar 30 14:17:16 obelix kernel: pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 Mar 30 14:17:16 obelix kernel: pci0: <ACPI PCI bus> on pcib0 Mar 30 14:17:16 obelix kernel: pcib1: <PCI-PCI bridge> at device 1.0 on pci0 Mar 30 14:17:16 obelix kernel: pci1: <PCI bus> on pcib1 Mar 30 14:17:16 obelix kernel: pci1: <display, VGA> at device 0.0 (no driver attached) Mar 30 14:17:16 obelix kernel: xl0: <3Com 3c905C-TX Fast Etherlink XL> port 0xd000-0xd07f mem 0xfb000000-0xfb00007f irq 18 at device 7.0 on pci0 Mar 30 14:17:16 obelix kernel: miibus0: <MII bus> on xl0 Mar 30 14:17:16 obelix kernel: xlphy0: <3c905C 10/100 internal PHY> on miibus0 Mar 30 14:17:16 obelix kernel: xlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto Mar 30 14:17:16 obelix kernel: xl0: Ethernet address: 00:50:da:cf:17:d3 Mar 30 14:17:16 obelix kernel: atapci0: <VIA 8237 UDMA133 controller> port 0xd400-0xd40f,0x376,0x170-0x177,0x3f6,0x1f0-0 x1f7 at device 15.0 on pci0 Mar 30 14:17:16 obelix kernel: ata0: channel #0 on atapci0 Mar 30 14:17:16 obelix kernel: ata1: channel #1 on atapci0 Mar 30 14:17:16 obelix kernel: uhci0: <VIA 83C572 USB controller> port 0xd800-0xd81f irq 21 at device 16.0 on pci0 Mar 30 14:17:16 obelix kernel: usb0: <VIA 83C572 USB controller> on uhci0 Mar 30 14:17:16 obelix kernel: usb0: USB revision 1.0 Mar 30 14:17:16 obelix kernel: uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 Mar 30 14:17:16 obelix kernel: uhub0: 2 ports with 2 removable, self powered Mar 30 14:17:16 obelix kernel: uhci1: <VIA 83C572 USB controller> port 0xdc00-0xdc1f irq 21 at device 16.1 on pci0 Mar 30 14:17:16 obelix kernel: usb1: <VIA 83C572 USB controller> on uhci1 Mar 30 14:17:16 obelix kernel: usb1: USB revision 1.0 Mar 30 14:17:16 obelix kernel: uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 Mar 30 14:17:16 obelix kernel: uhub1: 2 ports with 2 removable, self powered Mar 30 14:17:16 obelix kernel: uhci2: <VIA 83C572 USB controller> port 0xe000-0xe01f irq 21 at device 16.2 on pci0 Mar 30 14:17:16 obelix kernel: usb2: <VIA 83C572 USB controller> on uhci2 Mar 30 14:17:16 obelix kernel: usb2: USB revision 1.0 Mar 30 14:17:16 obelix kernel: uhub2: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 Mar 30 14:17:16 obelix kernel: uhub2: 2 ports with 2 removable, self powered Mar 30 14:17:16 obelix kernel: pci0: <serial bus, USB> at device 16.4 (no driver attached) Mar 30 14:17:16 obelix kernel: isab0: <PCI-ISA bridge> at device 17.0 on pci0 Mar 30 14:17:16 obelix kernel: isa0: <ISA bus> on isab0 Mar 30 14:17:16 obelix kernel: pci0: <multimedia, audio> at device 17.5 (no driver attached) Mar 30 14:17:16 obelix kernel: acpi_tz0: <Thermal Zone> on acpi0 Mar 30 14:17:16 obelix kernel: fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 Mar 30 14:17:16 obelix kernel: sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 Mar 30 14:17:16 obelix kernel: sio0: type 16550A Mar 30 14:17:16 obelix kernel: sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 Mar 30 14:17:16 obelix kernel: sio1: type 16550A Mar 30 14:17:16 obelix kernel: ppc0: <Standard parallel printer port> port 0x778-0x77b,0x378-0x37f irq 7 on acpi0 Mar 30 14:17:16 obelix kernel: ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode Mar 30 14:17:16 obelix kernel: ppbus0: <Parallel port bus> on ppc0 Mar 30 14:17:16 obelix kernel: plip0: <PLIP network interface> on ppbus0 Mar 30 14:17:16 obelix kernel: lpt0: <Printer> on ppbus0 Mar 30 14:17:16 obelix kernel: lpt0: Interrupt-driven port Mar 30 14:17:16 obelix kernel: ppi0: <Parallel I/O> on ppbus0 Mar 30 14:17:16 obelix kernel: atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0 Mar 30 14:17:16 obelix kernel: atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0 Mar 30 14:17:16 obelix kernel: kbd0 at atkbd0 Mar 30 14:17:16 obelix kernel: orm0: <ISA Option ROMs> at iomem 0xd0000-0xd07ff,0xc0000-0xcffff on isa0 Mar 30 14:17:16 obelix kernel: sc0: <System console> at flags 0x100 on isa0 Mar 30 14:17:16 obelix kernel: sc0: VGA <16 virtual consoles, flags=0x300> Mar 30 14:17:16 obelix kernel: vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Mar 30 14:17:16 obelix kernel: Timecounters tick every 1.000 msec Mar 30 14:17:16 obelix kernel: ad0: 190782MB <ST3200826A/3.01> [387621/16/63] at ata0-master UDMA100 Mar 30 14:17:16 obelix kernel: ad1: 190782MB <ST3200826A/3.01> [387621/16/63] at ata0-slave UDMA100 Mar 30 14:17:16 obelix kernel: acd0: DVDR <PIONEER DVD-RW DVR-108/1.04> at ata1-master UDMA66 Mar 30 14:17:16 obelix kernel: SMP: AP CPU #1 Launched! Mar 30 14:17:16 obelix kernel: Mounting root from ufs:/dev/ad0s1a
Greg 'groggy' Lehey wrote:> I've recently acquired an AMD64 box (dual Opteron 242, SiS Master@-FAR> motherboard> (http://www.msi.com.tw/program/products/server/svr/pro_svr_detail.php?UID=484).> See below for more details). I find it very unstable running with 8> GB memory, though 4 GB are not a problem. At first I thought it was> the onboard peripherals, but after disabling them it still persisted.>
What's unstable? I only once got it through the boot process.> Running a 5.3-RELEASE i386 kernel it panics, though I haven't> investigated the panic (yet), since I'm not interested in the i386> kernel. The amd64 5.4-PRERELEASE kernel just hangs/freezes. When the> peripherals are enabled, it's after probing the onboard NIC (bge) and> before probing SATA (no drives present). I've done a verbose boot, of> course, but no additional information is present. The NIC is> recognized, and that's all.>
Without the peripherals, but with a 3Com 3c905 PCI NIC, it continues> beyond this point, but doesn't enable the NIC. I don't have dmesg> output for these attempts, so I can't produce the exact message, and I> suspect it's not important. It continues until trying to mount NFS> file systems, where it hangs for obvious reasons. Pressing ^C causes> the system to either panic (and be unable to dump because I don't have> that much swap) or just hang.>
None of these problems occur when I use 4 GB memory. About the only> strangeness, which seems to come from the BIOS, is that it recognizes> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB> memory.>
I realize that this isn't enough to diagnose the problem. The reason> for this message now is to ask:>
1. Has anybody else seen this problem?> 2. Has anybody else used this hardware configuration and *not* seen> this problem?> 3. Where should I look next?>
I'm attaching the (non-verbose) dmesg from a successful boot.>
Greg> --> See complete headers for address and phone numbers.>
5.3-RELEASE has a lot of problems with >4GB due to busdma issues. Those should no longer be an issue in RELENG_5, including 5.4-PRE. You'll need to dig in and provide some more details, I guess. I have an HDAMA dual Opteron system that behaves fine now with 8GB of RAM, so your problem might lie with particular hardware and/or drivers.
Scott _______________________________________________
On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:> None of these problems occur when I use 4 GB memory. About the only> strangeness, which seems to come from the BIOS, is that it recognizes> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB> memory.>
I realize that this isn't enough to diagnose the problem. The reason> for this message now is to ask:>
1. Has anybody else seen this problem?> 2. Has anybody else used this hardware configuration and *not* seen> this problem?> 3. Where should I look next?>
Have you run sysutils/memtest86 with the 8 GB? I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB Reg. ECC DIMMs.
-- Steve _______________________________________________
Greg 'groggy' Lehey 31 March 2005 02:45:52 [ permanent link ]
On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:> On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:>> None of these problems occur when I use 4 GB memory. About the only>> strangeness, which seems to come from the BIOS, is that it recognizes>> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB>> memory.>>
I realize that this isn't enough to diagnose the problem. The reason>> for this message now is to ask:>>
1. Has anybody else seen this problem?>> 2. Has anybody else used this hardware configuration and *not* seen>> this problem?>> 3. Where should I look next?>
Have you run sysutils/memtest86 with the 8 GB?
Heh. Difficult when the system doesn't run.
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB> Reg. ECC DIMMs.
OK, this makes sense. It might also explain why the 4 GB configuration only recognizes 3.5 GB.
Greg See complete headers for address and phone numbers.
On Thu, Mar 31, 2005 at 08:14:45AM +0930, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:> > On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:> >> None of these problems occur when I use 4 GB memory. About the only> >> strangeness, which seems to come from the BIOS, is that it recognizes> >> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB> >> memory.> >>
I realize that this isn't enough to diagnose the problem. The reason> >> for this message now is to ask:> >>
3. Where should I look next?> >
Have you run sysutils/memtest86 with the 8 GB?>
Heh. Difficult when the system doesn't run.
That's what happens when 1 of 8 (1 of 4?) DIMM is bad
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB> > Reg. ECC DIMMs.>
OK, this makes sense. It might also explain why the 4 GB> configuration only recognizes 3.5 GB.
Search amd64 mailing list. The missing memory is reserved for something which escapes me at the moment. Similar to the infamous ISA memory hole.
-- Steve _______________________________________________
Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:>
On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:>>
None of these problems occur when I use 4 GB memory. About the only>>>strangeness, which seems to come from the BIOS, is that it recognizes>>>only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB>>>memory.>>>
I realize that this isn't enough to diagnose the problem. The reason>>>for this message now is to ask:>>>
1. Has anybody else seen this problem?>>>2. Has anybody else used this hardware configuration and *not* seen>>> this problem?>>>3. Where should I look next?>>
Have you run sysutils/memtest86 with the 8 GB?>
Heh. Difficult when the system doesn't run.>
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB>>Reg. ECC DIMMs.>
OK, this makes sense. It might also explain why the 4 GB> configuration only recognizes 3.5 GB.>
No, and I'm going to make this an FAQ and post it in a very obvious place, since 4+ GB is so easy to get and people don't seem to understand the PC architecture very well.
Almost all systems put the PCI Memory Mapped IO window into the 3.75-4GB region of the physical memory map. The registers for the APICs and other system resources are also typically in this region. Now with PCI-Express, the Memory Mapped PCI config registers are typically being mapped in the 3.5-3.75GB range. The memory controllers, host bridges, north-bridges, and/or whatever else glues the memory to the bus to the CPU decode these addresses into PCI cycles, not RAM cycles. Some systems are smart and re-map the RAM that is hidden by these holes into a region >4GB. Some systems are dumb, though, and just deny you access to the RAM that is covered up. It's very much like the old days of the XT/AT architecture when you had 1MB of RAM but everything above 640k was hidden by the VGA framebuffer, ISA option ROMs, and system BIOS, but some systems where smart enough to relocate the hidden RAM.
So, your missing .5GB is almost certainly not due to defective RAM, it's just due to The Way Things Are. It's a lot harder for Opteron systems to be smart about this than Xeon systems since all of the remapping magic can happen in the hostbridge on the Xeon, while the Opertons need to have their built-in memory controllers programmed specially for it.
Scott _______________________________________________
Greg 'groggy' Lehey 31 March 2005 03:10:16 [ permanent link ]
On Wednesday, 30 March 2005 at 16:04:44 -0700, Scott Long wrote:> Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 15:30:37 -0700, Scott Long wrote:>>
Greg 'groggy' Lehey wrote:>>>
I've recently acquired an AMD64 box ...>>>>
What's unstable? ... The amd64 5.4-PRERELEASE kernel just>>>> hangs/freezes.>>>
5.3-RELEASE has a lot of problems with >4GB due to busdma issues.>>> Those should no longer be an issue in RELENG_5, including 5.4-PRE.>>
They appear to be.>
I don't understand what you mean here.
As I said above (and trimmed for convenience), this problem occurs on 5.4-PRERELEASE as of yesterday morning. The dmesg shows that too.
As I described, it doesn't appear to be the drivers.>
I don't see how you proved or disproved this.
Shall I resend the original message? It seems independent of any particular driver. That's not proof, of course, but I didn't claim it was.
Greg See complete headers for address and phone numbers.
Greg 'groggy' Lehey 31 March 2005 03:16:31 [ permanent link ]
On Wednesday, 30 March 2005 at 16:01:14 -0700, Scott Long wrote:> Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:>>
On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:>>>
None of these problems occur when I use 4 GB memory. About the only>>>> strangeness, which seems to come from the BIOS, is that it recognizes>>>> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB>>>> memory.>>>
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB>>> Reg. ECC DIMMs.>>
OK, this makes sense. It might also explain why the 4 GB>> configuration only recognizes 3.5 GB.>
No, and I'm going to make this an FAQ and post it in a very obvious> place, since 4+ GB is so easy to get and people don't seem to understand> the PC architecture very well.
That's not easy to understand when it's barely documented. Thanks for the info: it helps a lot.
This may still be a hint, though: that memory hole doesn't show up during a boot with 8 GB RAM. How come? Is the system trying to map RAM over the PCI hole?
It looks as if I should get a verbose boot listing with 8 GB. It'll be a couple of hours before I find time to reboot this machine. In the meantime, there's a verbose boot with 4 GB at http://www.lemis.com/grog/Images/20050331/obelix-dmesg. I'm told it shows a number of strange things, including incorrect reporting of on-chip cache sizes.
Greg See complete headers for address and phone numbers.
Greg 'groggy' Lehey 31 March 2005 03:18:39 [ permanent link ]
On Wednesday, 30 March 2005 at 14:57:15 -0800, Steve Kargl wrote:> On Thu, Mar 31, 2005 at 08:14:45AM +0930, Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:>>> On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:>>>> None of these problems occur when I use 4 GB memory. About the only>>>> strangeness, which seems to come from the BIOS, is that it recognizes>>>> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB>>>> memory.>>>>
I realize that this isn't enough to diagnose the problem. The reason>>>> for this message now is to ask:>>>>
3. Where should I look next?>>>
Have you run sysutils/memtest86 with the 8 GB?>>
Heh. Difficult when the system doesn't run.>
That's what happens when 1 of 8 (1 of 4?) DIMM is bad
I've booted with the other 2 DIMMs now (I have 4 2 GB DIMMs, all the MB will hold). No problems. See my last reply to Scott: I'm wondering if the system is ignoring the PCI hole.
Greg See complete headers for address and phone numbers.
On Wednesday 30 March 2005 03:15 pm, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 16:01:14 -0700, Scott Long wrote:> > Greg 'groggy' Lehey wrote:> >> On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:> >>> On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:> >>>> None of these problems occur when I use 4 GB memory. About the> >>>> only strangeness, which seems to come from the BIOS, is that it> >>>> recognizes only 3.5 GB. If I put all DIMMS in, it recognizes> >>>> the full 8 GB memory.> >>>
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700> >>> 2GB Reg. ECC DIMMs.> >>
OK, this makes sense. It might also explain why the 4 GB> >> configuration only recognizes 3.5 GB.> >
No, and I'm going to make this an FAQ and post it in a very obvious> > place, since 4+ GB is so easy to get and people don't seem to> > understand the PC architecture very well.>
That's not easy to understand when it's barely documented. Thanks> for the info: it helps a lot.>
This may still be a hint, though: that memory hole doesn't show up> during a boot with 8 GB RAM. How come? Is the system trying to map> RAM over the PCI hole?
Nope, its still there. When you boot -v, you'll see the hole in the "Physical memory chunk(s)" list.
However, I suspect that some of the bioses will set the 4GB hole partition in the physical ram lower so that there will be 4.5GB of ram above the 4GB mark. I haven't looked too closely to see for sure.
It looks as if I should get a verbose boot listing with 8 GB. It'll> be a couple of hours before I find time to reboot this machine. In> the meantime, there's a verbose boot with 4 GB at> http://www.lemis.com/grog/Images/20050331/obelix-dmesg. I'm told it> shows a number of strange things, including incorrect reporting of> on-chip cache sizes.
Nope, it is correct. You have 1MB of L2 cache. L1 data cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L1 instruction cache: 64 kbytes, 64 bytes/line, 1 lines/tag, 2-way associative L2 unified cache: 1024 kbytes, 64 bytes/line, 1 lines/tag, 16-way associative
Greg> --> See complete headers for address and phone numbers.
-- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 _______________________________________________
Ask Bjrn Hansen 31 March 2005 03:24:18 [ permanent link ]
...... Original Message ....... On Thu, 31 Mar 2005 08:14:45 +0930 "Greg 'groggy' Lehey" <grog@FreeBSD.org> wrote:>> Have you run sysutils/memtest86 with the 8 GB?>
Heh. Difficult when the system doesn't run.
There is a bootable ISO version of memtest86 that you could try.
Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 16:01:14 -0700, Scott Long wrote:>
Greg 'groggy' Lehey wrote:>>
On Wednesday, 30 March 2005 at 14:35:46 -0800, Steve Kargl wrote:>>>
On Thu, Mar 31, 2005 at 07:54:39AM +0930, Greg 'groggy' Lehey wrote:>>>>
None of these problems occur when I use 4 GB memory. About the only>>>>>strangeness, which seems to come from the BIOS, is that it recognizes>>>>>only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB>>>>>memory.>>>>
I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB>>>>Reg. ECC DIMMs.>>>
OK, this makes sense. It might also explain why the 4 GB>>>configuration only recognizes 3.5 GB.>>
No, and I'm going to make this an FAQ and post it in a very obvious>>place, since 4+ GB is so easy to get and people don't seem to understand>>the PC architecture very well.>
That's not easy to understand when it's barely documented. Thanks for> the info: it helps a lot.>
This may still be a hint, though: that memory hole doesn't show up> during a boot with 8 GB RAM. How come? Is the system trying to map> RAM over the PCI hole?>
It looks as if I should get a verbose boot listing with 8 GB. It'll> be a couple of hours before I find time to reboot this machine. In> the meantime, there's a verbose boot with 4 GB at> http://www.lemis.com/grog/Images/20050331/obelix-dmesg. I'm told it> shows a number of strange things, including incorrect reporting of> on-chip cache sizes.>
The SMAP will show the hole. It's well documented in most PC archtitecure books.
Scott _______________________________________________
Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 16:04:44 -0700, Scott Long wrote:>
Greg 'groggy' Lehey wrote:>>
On Wednesday, 30 March 2005 at 15:30:37 -0700, Scott Long wrote:>>>
Greg 'groggy' Lehey wrote:>>>>
I've recently acquired an AMD64 box ...>>>>>
What's unstable? ... The amd64 5.4-PRERELEASE kernel just>>>>>hangs/freezes.>>>>
5.3-RELEASE has a lot of problems with >4GB due to busdma issues.>>>>Those should no longer be an issue in RELENG_5, including 5.4-PRE.>>>
They appear to be.>>
I don't understand what you mean here.>
As I said above (and trimmed for convenience), this problem occurs on> 5.4-PRERELEASE as of yesterday morning. The dmesg shows that too.>
And you're certain that it's due to the same busdma issues that I was describing? I must have missed the evidence that you use to support this.
As I described, it doesn't appear to be the drivers.>>
I don't see how you proved or disproved this.>
Shall I resend the original message? It seems independent of any> particular driver. That's not proof, of course, but I didn't claim it> was.
Again, I must have missed the part where you investigated the drivers that apply to your particular system. I highly doubt that they apply to every 8GB Opteron system available on the market.
Scott _______________________________________________
On Wednesday 30 March 2005 03:09 pm, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 16:04:44 -0700, Scott Long wrote:> > Greg 'groggy' Lehey wrote:> >> On Wednesday, 30 March 2005 at 15:30:37 -0700, Scott Long wrote:> >>> Greg 'groggy' Lehey wrote:> >>>> I've recently acquired an AMD64 box ...> >>>>
5.3-RELEASE has a lot of problems with >4GB due to busdma issues.> >>> Those should no longer be an issue in RELENG_5, including> >>> 5.4-PRE.> >>
They appear to be.> >
I don't understand what you mean here.>
As I said above (and trimmed for convenience), this problem occurs on> 5.4-PRERELEASE as of yesterday morning. The dmesg shows that too.>
As I described, it doesn't appear to be the drivers.> >
I don't see how you proved or disproved this.>
Shall I resend the original message? It seems independent of any> particular driver. That's not proof, of course, but I didn't claim> it was.
Greg: The busdma problems from 5.3-RELEASE are fixed. That doesn't mean that there are no *other* problems. Scott is saying "the old busdma bug shouldn't be affecting 5.4-PRE", and he's correct.
Most likely, something else is happening, eg: you're running out of KVM or something silly like that. I know we're right on the brink at 8GB. The layout of the devices may be just enough to tip it over the edge. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 _______________________________________________
On Wednesday 30 March 2005 03:22 pm, Ask Bj=F8rn Hansen wrote:> ...... Original Message .......> On Thu, 31 Mar 2005 08:14:45 +0930 "Greg 'groggy' Lehey"> <grog@FreeBSD.org>>
wrote:> >> Have you run sysutils/memtest86 with the 8 GB?> >
Heh. Difficult when the system doesn't run.>
There is a bootable ISO version of memtest86 that you could try.
Thats what the port does.. It produces a bootable floppy or ISO.
=2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 _______________________________________________
M. Warner Losh 31 March 2005 03:38:22 [ permanent link ]
In message: <20050330231753.GA84137@wantadilla.lemis.com> "Greg 'groggy' Lehey" <grog@freebsd.org> writes: : I've booted with the other 2 DIMMs now (I have 4 2 GB DIMMs, all the : MB will hold). No problems. See my last reply to Scott: I'm : wondering if the system is ignoring the PCI hole.
Unlikely. If it was, you'd not have enough of a system to complain about.
Warner _______________________________________________
On Wednesday, 30 March 2005 at 16:23:34 -0700, Scott Long wrote:> Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 16:04:44 -0700, Scott Long wrote:>>> Greg 'groggy' Lehey wrote:>>>> On Wednesday, 30 March 2005 at 15:30:37 -0700, Scott Long wrote:>>>>> Greg 'groggy' Lehey wrote:>>>>>> I've recently acquired an AMD64 box ...>>>>>>
What's unstable? ... The amd64 5.4-PRERELEASE kernel just>>>>>> hangs/freezes.>>>>>
5.3-RELEASE has a lot of problems with >4GB due to busdma issues.>>>>> Those should no longer be an issue in RELENG_5, including 5.4-PRE.>>>>
They appear to be.>>>
I don't understand what you mean here.>>
As I said above (and trimmed for convenience), this problem occurs on>> 5.4-PRERELEASE as of yesterday morning. The dmesg shows that too.>
And you're certain that it's due to the same busdma issues that I> was describing?
No.
I must have missed the evidence that you use to support this.
I didn't give any. It appears that I misunderstood what you were saying.
As I described, it doesn't appear to be the drivers.>>>
I don't see how you proved or disproved this.>>
Shall I resend the original message? It seems independent of any>> particular driver. That's not proof, of course, but I didn't claim it>> was.>
Again, I must have missed the part where you investigated the drivers> that apply to your particular system.
The description is still there.
I highly doubt that they apply to every 8GB Opteron system available> on the market.
I never suggested that they did. There's every reason to believe that it's something to do with this particular motherboard, but that doesn't mean that FreeBSD is blameless.
Greg When replying to this message, please take care not to mutilate the original text. For more information, see http://www.lemis.com/email.html See complete headers for address and phone numbers.
Greg 'groggy' Lehey 31 March 2005 03:40:11 [ permanent link ]
On Wednesday, 30 March 2005 at 15:25:37 -0800, Peter Wemm wrote:> On Wednesday 30 March 2005 03:09 pm, Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 16:04:44 -0700, Scott Long wrote:>>> Greg 'groggy' Lehey wrote:>>>> As I described, it doesn't appear to be the drivers.>>>
I don't see how you proved or disproved this.>>
Shall I resend the original message? It seems independent of any>> particular driver. That's not proof, of course, but I didn't claim>> it was.>
Greg: The busdma problems from 5.3-RELEASE are fixed. That doesn't> mean that there are no *other* problems. Scott is saying "the old> busdma bug shouldn't be affecting 5.4-PRE", and he's correct.
Yes, now I understand.
Most likely, something else is happening, eg: you're running out of KVM> or something silly like that. I know we're right on the brink at 8GB.> The layout of the devices may be just enough to tip it over the edge.
Yes, this seems reasonable. Where should I look next? I'm currently rebuilding world and will attempt a verbose boot via serial console when it's done. Anything else I should try?
Greg See complete headers for address and phone numbers.
-- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
On Thu, Mar 31, 2005 at 10:32:33AM +0930, Daniel O'Connor wrote:> On Thu, 31 Mar 2005 08:14, Greg 'groggy' Lehey wrote:> > > Have you run sysutils/memtest86 with the 8 GB?> >
Daniel O'Connor 31 March 2005 05:20:04 [ permanent link ]
On Thu, 31 Mar 2005 10:40, Steve Kargl wrote:> On Thu, Mar 31, 2005 at 10:32:33AM +0930, Daniel O'Connor wrote:> > On Thu, 31 Mar 2005 08:14, Greg 'groggy' Lehey wrote:> > > > Have you run sysutils/memtest86 with the 8 GB?> > >
-- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
Greg 'groggy' Lehey 31 March 2005 05:55:52 [ permanent link ]
On Thursday, 31 March 2005 at 10:32:33 +0930, Daniel O'Connor wrote:> On Thu, 31 Mar 2005 08:14, Greg 'groggy' Lehey wrote:>>> Have you run sysutils/memtest86 with the 8 GB?>>
I'm pretty sure it's not the memory. I've tried each pair individually, and it's only when they're both in there together that it's a problem. And yes, I've tried them in each pair of slots.
bge0: <Broadcom BCM5705 Gigabit Ethernet, ASIC rev. 0x3003> mem 0xfa000000-0xfa00ffff irq 16 at device 11.0 on pci0> bge0: Reserved 0x10000 bytes for rid 0x10 type 3 at 0xfa000000
They're identical in each probe.
Greg See complete headers for address and phone numbers.
Daniel O'Connor 31 March 2005 06:01:02 [ permanent link ]
On Thu, 31 Mar 2005 11:24, Greg 'groggy' Lehey wrote:> I'm pretty sure it's not the memory. I've tried each pair> individually, and it's only when they're both in there together that> it's a problem. And yes, I've tried them in each pair of slots.
Could be a marginal timing issue.. You could try winding out the RAM timing slightly.
-- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
Matthias Buelow 31 March 2005 06:53:52 [ permanent link ]
Greg 'groggy' Lehey wrote:
I'm pretty sure it's not the memory. I've tried each pair>individually, and it's only when they're both in there together that>it's a problem. And yes, I've tried them in each pair of slots.
I'm sure you have checked this aswell but just for completeness, they aren't different pairs? Like one pair is single-sided and the other double-sided (had some nasty and obscure problems with such a combination myself)?
Greg 'groggy' Lehey 31 March 2005 07:27:10 [ permanent link ]
On Thursday, 31 March 2005 at 5:54:17 +0200, Matthias Buelow wrote:> Greg 'groggy' Lehey wrote:>
I'm pretty sure it's not the memory. I've tried each pair>> individually, and it's only when they're both in there together that>> it's a problem. And yes, I've tried them in each pair of slots.>
I'm sure you have checked this aswell but just for completeness,> they aren't different pairs? Like one pair is single-sided and the> other double-sided (had some nasty and obscure problems with such> a combination myself)?
No, they're all the same.
Greg See complete headers for address and phone numbers.
This shows that in the - case the APIC is broken somehow (0.0 isn't a valid I/O APIC version). It would seem that the system has mapped RAM over top of the I/O APIC perhaps? It would be interesting to see the contents of your MADT to see if it's trying to use a 64-bit PA for your APIC. The local APIC portion seems ok though.
This shows that in the - case the APIC is broken somehow (0.0 isn't a>>valid I/O APIC version). >
You mean the + case, I suppose. Yes, that's what I suspected.>
It would seem that the system has mapped RAM over top of the I/O>>APIC perhaps?>
That's what I suspected too, but imp doesn't think so.>
I'd be more inclined to believe that there is an erroneous mapping by the OS, not that things are fundamentally broken in hardware. Your SMAP table shows everything correctly. It's becoming hard to break through your pre-concieved notions here and explain how things actually work.
It would be interesting to see the contents of your MADT to see if>>it's trying to use a 64-bit PA for your APIC.>
Any suggestions about how to do so?>
man acpidump _______________________________________________
Greg 'groggy' Lehey 31 March 2005 09:15:39 [ permanent link ]
[gratuitous empty lines removed]
On Wednesday, 30 March 2005 at 21:28:36 -0700, Scott Long wrote:> Greg 'groggy' Lehey wrote:>> On Wednesday, 30 March 2005 at 23:01:03 -0500, John Baldwin wrote:>>
On Mar 30, 2005, at 8:54 PM, Greg 'groggy' Lehey wrote:>>>
This shows that in the - case the APIC is broken somehow (0.0 isn't a>>> valid I/O APIC version).>>
You mean the + case, I suppose. Yes, that's what I suspected.>>
It would seem that the system has mapped RAM over top of the I/O>>> APIC perhaps?>>
That's what I suspected too, but imp doesn't think so.>
I'd be more inclined to believe that there is an erroneous mapping> by the OS, not that things are fundamentally broken in hardware.
Agreed. This has been my favourite hypothesis all along. But isn't that what jhb is saying?
Your SMAP table shows everything correctly. It's becoming hard to> break through your pre-concieved notions here and explain how things> actually work.
No, there's nothing to break through. I think you're just having problems
1. expressing yourself, and 2. understanding what I'm saying.
I have no preconceived notions. All I can see here is an antagonistic attitude on your part. What's the problem? You'll recall from my first message that I asked for suggestions about how to approach the issue. jhb provided some; you haven't so far. From what you've written, it's unclear whether you disagree with jhb or not. If you do, why? If you don't, what's your point here?
It would be interesting to see the contents of your MADT to see if>>> it's trying to use a 64-bit PA for your APIC.>>
Any suggestions about how to do so?>
man acpidump
How do you run that on a system that won't boot?
Greg See complete headers for address and phone numbers.
On 03/30/05 23:14, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 21:28:36 -0700, Scott Long wrote:>>Greg 'groggy' Lehey wrote:>>>On Wednesday, 30 March 2005 at 23:01:03 -0500, John Baldwin wrote:>>>>On Mar 30, 2005, at 8:54 PM, Greg 'groggy' Lehey wrote:>>>>>>lapic0: LINT1 trigger: edge>>>>>>lapic0: LINT1 polarity: high>>>>>>lapic1: Routing NMI -> LINT1>>>>>>lapic1: LINT1 trigger: edge>>>>>>lapic1: LINT1 polarity: high>>>>>>-ioapic0 <Version 0.3> irqs 0-23 on motherboard>>>>>>+ioapic0 <Version 0.0> irqs 0-23 on motherboard>>>>>>cpu0 BSP:>>>>>> ID: 0x00000000 VER: 0x00040010 LDR: 0x01000000 DFR: 0x0fffffff>>>>>>lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff>>>>
This shows that in the - case the APIC is broken somehow (0.0 isn't a>>>>valid I/O APIC version).>>>
You mean the + case, I suppose. Yes, that's what I suspected.>>>
It would seem that the system has mapped RAM over top of the I/O>>>>APIC perhaps?>>>
That's what I suspected too, but imp doesn't think so.>>
I'd be more inclined to believe that there is an erroneous mapping>>by the OS, not that things are fundamentally broken in hardware.>
Agreed. This has been my favourite hypothesis all along. But isn't> that what jhb is saying?>
Your SMAP table shows everything correctly. It's becoming hard to>>break through your pre-concieved notions here and explain how things>>actually work.>
No, there's nothing to break through. I think you're just having> problems>
1. expressing yourself, and> 2. understanding what I'm saying.>
I have no preconceived notions. All I can see here is an antagonistic> attitude on your part. What's the problem? You'll recall from my> first message that I asked for suggestions about how to approach the> issue. jhb provided some; you haven't so far. From what you've> written, it's unclear whether you disagree with jhb or not. If you> do, why? If you don't, what's your point here?>
It would be interesting to see the contents of your MADT to see if>>>>it's trying to use a 64-bit PA for your APIC.>>>
Any suggestions about how to do so?>>
man acpidump>
How do you run that on a system that won't boot?
You said the system worked with 4 GB (albeit detecting only 3.5 GB). My perception of this whole ACPI thing is that it is fixed in your BIOS (although it can be overridden by the OS). As such, the amount of RAM you have in the machine shouldn't change acpidump results. Is that not correct?
Jon _______________________________________________
This shows that in the - case the APIC is broken somehow (0.0 isn't a>>>>> valid I/O APIC version).>>>>
You mean the + case, I suppose. Yes, that's what I suspected.>>>>
It would seem that the system has mapped RAM over top of the I/O>>>>> APIC perhaps?>>>>
That's what I suspected too, but imp doesn't think so.>>>
I'd be more inclined to believe that there is an erroneous mapping>>> by the OS, not that things are fundamentally broken in hardware.>>
Agreed. This has been my favourite hypothesis all along. But isn't>> that what jhb is saying?>>
Your SMAP table shows everything correctly. It's becoming hard to>>> break through your pre-concieved notions here and explain how things>>> actually work.>>
No, there's nothing to break through. I think you're just having>> problems>>
1. expressing yourself, and>> 2. understanding what I'm saying.>>
I have no preconceived notions. All I can see here is an antagonistic>> attitude on your part. What's the problem? You'll recall from my>> first message that I asked for suggestions about how to approach the>> issue. jhb provided some; you haven't so far. From what you've>> written, it's unclear whether you disagree with jhb or not. If you>> do, why? If you don't, what's your point here?>>
It would be interesting to see the contents of your MADT to see if>>>>> it's trying to use a 64-bit PA for your APIC.>>>>
Any suggestions about how to do so?>>>
man acpidump>>
How do you run that on a system that won't boot?>
You said the system worked with 4 GB (albeit detecting only 3.5 GB). My > perception of this whole ACPI thing is that it is fixed in your BIOS > (although it can be overridden by the OS). As such, the amount of RAM > you have in the machine shouldn't change acpidump results. Is that not > correct?>
Jon
This is absolutely correct.
Scott _______________________________________________
Greg 'groggy' Lehey 31 March 2005 09:49:37 [ permanent link ]
On Wednesday, 30 March 2005 at 22:27:43 -0700, Scott Long wrote:> Jon Noack wrote:>> On 03/30/05 23:14, Greg 'groggy' Lehey wrote:>>> On Wednesday, 30 March 2005 at 21:28:36 -0700, Scott Long wrote:>>>> Greg 'groggy' Lehey wrote:>>>>> On Wednesday, 30 March 2005 at 23:01:03 -0500, John Baldwin wrote:>>>>>> It would be interesting to see the contents of your MADT to see if>>>>>> it's trying to use a 64-bit PA for your APIC.>>>>>
Any suggestions about how to do so?>>>>
man acpidump>>>
How do you run that on a system that won't boot?>>
You said the system worked with 4 GB (albeit detecting only 3.5>> GB).
Yes, this is correct. A number of people have explained why it only detected 3.5 GB in this configuration.
My perception of this whole ACPI thing is that it is fixed in your>> BIOS (although it can be overridden by the OS). As such, the>> amount of RAM you have in the machine shouldn't change acpidump>> results. Is that not correct?>
This is absolutely correct.
Ah, so you meant to say that the output from the system running with 4 GB memory is useful? That wasn't in the man page you pointed to. What it does say is:
When invoked with the -t flag, the acpidump utility dumps contents of> the following tables:>
... MADT
This may be the case, but between man page and output some terminology must have changed. I can't see any reference to anything like an MADT there. Does that mean that there isn't one, or that ACPI can't find it, or does the section APIC refer to/dump the MADT? Here's the complete output of acpidump -t, anyway:
Since I don't know anything about ACPI, this doesn't say too much to me. Suggestions welcome. If the APIC section is the MADT, it looks as if we should update the docco.
Greg -- See complete headers for address and phone numbers.
On 03/30/05 23:49, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 22:27:43 -0700, Scott Long wrote:>>Jon Noack wrote:>>>On 03/30/05 23:14, Greg 'groggy' Lehey wrote:>>>>On Wednesday, 30 March 2005 at 21:28:36 -0700, Scott Long wrote:>>>>>Greg 'groggy' Lehey wrote:>>>>>>On Wednesday, 30 March 2005 at 23:01:03 -0500, John Baldwin wrote:>>>>>>>It would be interesting to see the contents of your MADT to see if>>>>>>>it's trying to use a 64-bit PA for your APIC.>>>>>>
Any suggestions about how to do so?>>>>>
man acpidump>>>>
How do you run that on a system that won't boot?>>>
You said the system worked with 4 GB (albeit detecting only 3.5>>>GB).>
Yes, this is correct. A number of people have explained why it only> detected 3.5 GB in this configuration.>
My perception of this whole ACPI thing is that it is fixed in your>>>BIOS (although it can be overridden by the OS). As such, the>>>amount of RAM you have in the machine shouldn't change acpidump>>>results. Is that not correct?>>
This is absolutely correct.>
Ah, so you meant to say that the output from the system running with 4> GB memory is useful? That wasn't in the man page you pointed to.> What it does say is:>
When invoked with the -t flag, the acpidump utility dumps contents of>>the following tables:>>
... MADT>
This may be the case, but between man page and output some terminology> must have changed. I can't see any reference to anything like an MADT> there. Does that mean that there isn't one, or that ACPI can't find> it, or does the section APIC refer to/dump the MADT? Here's the> complete output of acpidump -t, anyway:>
<snip acpidump output>>
Since I don't know anything about ACPI, this doesn't say too much to> me. Suggestions welcome. If the APIC section is the MADT, it looks> as if we should update the docco.
Greg 'groggy' Lehey 31 March 2005 10:18:45 [ permanent link ]
On Thursday, 31 March 2005 at 0:00:22 -0600, Jon Noack wrote:> On 03/30/05 23:49, Greg 'groggy' Lehey wrote:>> Here's the complete output of acpidump -t, anyway:>>
<snip acpidump output>>>
Since I don't know anything about ACPI, this doesn't say too much to>> me. Suggestions welcome. If the APIC section is the MADT, it looks>> as if we should update the docco.>
Since we are discussing AMD64 with 8GB RAM, I also would like to point my problem.
I'm still looking for possibility to run FreeBSD 5.3-STABLE with more than 4GB RAM on Dual amd64 2.2GHz machine (IBM @server 325) with ServeRAID 6M (ips driver)). Right now I'm using only 4GB RAM and this server is in production.
"The ips driver looks like it will fail under heavy load when more than 4GB of RAM is present. It tries to force busdma to not defer requests when the bounce page reserve is low, but that looks to be broken and will result in corrupted commands."
Are the ips driver and bus_dma problems fixed yet in STABLE tree? Is it worth to try source update and see how it works? I'm afraid to do so, since it is production server.
Since we are discussing AMD64 with 8GB RAM, I also would like to point my> problem.>
I'm still looking for possibility to run FreeBSD 5.3-STABLE with more> than 4GB RAM > on Dual amd64 2.2GHz machine (IBM @server 325) with ServeRAID 6M (ips> driver)).> Right now I'm using only 4GB RAM and this server is in production.>
As Scott said a few months ago, problem is below:>
"The ips driver looks like it will fail under heavy load when more> than 4GB > of RAM is present. It tries to force busdma to not defer requests> when the > bounce page reserve is low, but that looks to be broken and> will result in corrupted commands.">
Are the ips driver and bus_dma problems fixed yet in STABLE tree? > Is it worth to try source update and see how it works? I'm afraid to do> so, since it is production server.>
Yes, I (hopefully) fixed the problems that I pointed out, and I also locked it and added crashdump support. It is reported to be stable and fast now, so you won't go wrong by updating. These changes are in both 5-stable and 6-current.
Scott _______________________________________________
On Wednesday 30 March 2005 09:49 pm, Greg 'groggy' Lehey wrote:> On Wednesday, 30 March 2005 at 22:27:43 -0700, Scott Long wrote:> > Jon Noack wrote:> >> On 03/30/05 23:14, Greg 'groggy' Lehey wrote:> >>> On Wednesday, 30 March 2005 at 21:28:36 -0700, Scott Long wrote:> >>>> Greg 'groggy' Lehey wrote:> >>>>> On Wednesday, 30 March 2005 at 23:01:03 -0500, John Baldwin wrote:> >>>>>> It would be interesting to see the contents of your MADT to see if> >>>>>> it's trying to use a 64-bit PA for your APIC.> >>>>>
Any suggestions about how to do so?> >>>>
man acpidump> >>>
How do you run that on a system that won't boot?> >>
You said the system worked with 4 GB (albeit detecting only 3.5> >> GB).>
Yes, this is correct. A number of people have explained why it only> detected 3.5 GB in this configuration.>
You're also being confused by the implementation of the 'real memory' report. If you take a 30 second glance at the code, you'll see that it is reporting the same units that the hw.maxmem tunable uses. ie: it is the LIMIT or Highest Address that the system has, not the sum total of all the parts.
eg: see the machdep.c comment next to the printf * Maxmem isn't the "maximum memory", it's one larger than the * highest page of the physical address space. It should be * called something like "Maxphyspage". We may adjust this * based on ``hw.physmem'' and the results of the memory test.
The SMAP lines are what you need to pay attention to. In the output you posted with 8G, you can see the 4GB going from the 4->8GB range, exactly. SMAP type 1 is "usable memory".
This shows that in the - case the APIC is broken somehow (0.0 isn't a>> valid I/O APIC version).>
You mean the + case, I suppose. Yes, that's what I suspected.>
It would seem that the system has mapped RAM over top of the I/O>> APIC perhaps?>
That's what I suspected too, but imp doesn't think so.
Actually, if the full version register were zero, it would not have had 24 IRQs (irqs 0-23 part), so I'm not sure what it is doing. 0.3 isn't really a valid APIC version AFAIK either, though I'm more familiar with the versions used in Intel APICs (usually 1.1, 1.2, or 2.0).
It would be interesting to see the contents of your MADT to see if>> it's trying to use a 64-bit PA for your APIC.>
Any suggestions about how to do so?
Boot with 4g or boot an i386 version and get acpidump -t output.
On Mar 31, 2005, at 12:49 AM, Greg 'groggy' Lehey wrote:
This may be the case, but between man page and output some terminology> must have changed. I can't see any reference to anything like an MADT> there. Does that mean that there isn't one, or that ACPI can't find> it, or does the section APIC refer to/dump the MADT? Here's the> complete output of acpidump -t, anyway:
MADT is the name of the table (Multiple APIC Descriptor Table or some such), but "APIC" is the 4 character signature of the MADT, hence seeing 'APIC' output from acpidump -t when looking at the MADT. Similarly, the MP Table is known as the MP Table, but the signature for the table that you search for in the BIOS is "_MP_".
Nothing strange here, and it is giving a 64-bit PA for the I/O APIC, albeit one that is < 4GB. One thing to verify is that the physical addresses listed here for the APICs (0xfec00000 and 0xfee00000) aren't included in the SMAP as valid RAM addresses in both cases. It might be useful to boot an i386 CD with 8GB in the machine to see if the MADT looks any different in that case.
This shows that in the - case the APIC is broken somehow (0.0 >>>>>> isn't a>>>>>> valid I/O APIC version).>>>>>
You mean the + case, I suppose. Yes, that's what I suspected.>>>>>
It would seem that the system has mapped RAM over top of the I/O>>>>>> APIC perhaps?>>>>>
That's what I suspected too, but imp doesn't think so.>>>>
I'd be more inclined to believe that there is an erroneous mapping>>>> by the OS, not that things are fundamentally broken in hardware.>>>
Agreed. This has been my favourite hypothesis all along. But isn't>>> that what jhb is saying?>>>
Your SMAP table shows everything correctly. It's becoming hard to>>>> break through your pre-concieved notions here and explain how things>>>> actually work.>>>
No, there's nothing to break through. I think you're just having>>> problems>>>
1. expressing yourself, and>>> 2. understanding what I'm saying.>>>
I have no preconceived notions. All I can see here is an >>> antagonistic>>> attitude on your part. What's the problem? You'll recall from my>>> first message that I asked for suggestions about how to approach the>>> issue. jhb provided some; you haven't so far. From what you've>>> written, it's unclear whether you disagree with jhb or not. If you>>> do, why? If you don't, what's your point here?>>>
It would be interesting to see the contents of your MADT to see if>>>>>> it's trying to use a 64-bit PA for your APIC.>>>>>
Any suggestions about how to do so?>>>>
man acpidump>>>
How do you run that on a system that won't boot?>> You said the system worked with 4 GB (albeit detecting only 3.5 GB). >> My perception of this whole ACPI thing is that it is fixed in your >> BIOS (although it can be overridden by the OS). As such, the amount >> of RAM you have in the machine shouldn't change acpidump results. Is >> that not correct?>> Jon>
This is absolutely correct.
It might though. Notice the change in APIC version with 4GB of RAM vs 8GB. The APIC hardware is the same, so that's already indicative of something fishy going on. I think that his APIC address is correct though as otherwise no interrupts at all would work and it wouldn't claim to have 24 IRQs on the APIC in both cases. One can always boot an i386 non-PAE kernel with 8GB in the machine and get an acpidump though.
On Wed, Mar 30, 2005 at 03:25:37PM -0800, Peter Wemm wrote:> Greg: The busdma problems from 5.3-RELEASE are fixed. That doesn't > mean that there are no *other* problems. Scott is saying "the old > busdma bug shouldn't be affecting 5.4-PRE", and he's correct.>
Most likely, something else is happening, eg: you're running out of KVM > or something silly like that. I know we're right on the brink at 8GB. > The layout of the devices may be just enough to tip it over the edge.
Grog's motherboard is a 4+0 configuration -- which would mean he is using (trying to) 2GB DIMM's. There are memory bus loading specifictions he may be out of spec of.
-- -- David (obrien@FreeBSD.org) _______________________________________________
On Thu, Mar 31, 2005 at 08:14:45AM +0930, Greg 'groggy' Lehey wrote:> > I had 4 bad out of 12 tested where the DIMMs were Crucial PC2700 2GB> > Reg. ECC DIMMs.>
OK, this makes sense. It might also explain why the 4 GB> configuration only recognizes 3.5 GB.
No. This is due to the 3.5-4.0GB PA address range that the PeeCee architecture reserves for the PCI config space, AGP GART, memory mapped I/O, etc... Many Opteron BIOS's don't bother to hoist the "covered" memory above 4GB.
Please see the freebsd-amd64 archives -- this has been discussed many times.
-- -- David (obrien@FreeBSD.org) _______________________________________________
On Thu, Mar 31, 2005 at 10:32:33AM +0930, Daniel O'Connor wrote:> On Thu, 31 Mar 2005 08:14, Greg 'groggy' Lehey wrote:> > > Have you run sysutils/memtest86 with the 8 GB?> >
On Thu, Mar 31, 2005 at 11:24:29AM +0930, Greg 'groggy' Lehey wrote:> On Thursday, 31 March 2005 at 10:32:33 +0930, Daniel O'Connor wrote:> > On Thu, 31 Mar 2005 08:14, Greg 'groggy' Lehey wrote:> >>> Have you run sysutils/memtest86 with the 8 GB?> >>
I'm pretty sure it's not the memory. I've tried each pair> individually, and it's only when they're both in there together that> it's a problem. And yes, I've tried them in each pair of slots.
You have a dual-channel memory controller. If you insert one DIMM you perform 64-bit data accesses. If you install DIMM's in pairs (making sure you're using the right "paired" sockets), you perform 128-bit data accesses. Thus your access pattern is different between these two situations. I'm highly suspious that you can us 4x2GB DIMM's with out knowing the exact part number. Don't forget 2GB DIMM's are double-stacked and thus look like double the electrical bus loads. The same is true for older 1GB DIMM's.
Install all the memory you would like to use into your motherboard, download memtest86+ version 1.40 from http://www.memtest.org, dd to floppy or burn the ISO, and report back your findings from running it.
Also what version of the BIOS are you using?
-- -- David (obrien@FreeBSD.org) _______________________________________________
Date: Thu, 31 Mar 2005 16:12:35 +0900> From: Ganbold <ganbold@micom.mng.net>> Subject: Re: Problems with AMD64 and 8 GB RAM?>
Hi,>
Since we are discussing AMD64 with 8GB RAM, I also would like to point my> problem.>
I'm still looking for possibility to run FreeBSD 5.3-STABLE with more than> 4GB RAM> on Dual amd64 2.2GHz machine (IBM @server 325) with ServeRAID 6M (ips> driver)).> Right now I'm using only 4GB RAM and this server is in production.>
As Scott said a few months ago, problem is below:>
"The ips driver looks like it will fail under heavy load when more than 4GB> of RAM is present. It tries to force busdma to not defer requests when the> bounce page reserve is low, but that looks to be broken and> will result in corrupted commands."
[Alan Jay] Since we are talking about FreeBSD on AMD64 on the AMD64 list I have reported issues on that list.
I have a TyanThunder K8S pro S2882 twin Operteron with 8Gb of RAM and although I can get the machine to run reasonably stably with 8Gb of RAM with limited loading when pushed it falls over unpredictably.
We did some tests with the latest 5.3-STABLE / 5.4-PRERELEASE and still found the same issues when using a mySQL database heavily hit over the Ethernet controller. Our final tests limited the memory on boot-up to 4Gb and the bug is still there so we think it may well be some interaction with the Ethernet controller. The motherboard we have has a BroadcomBCM5704C 10/100/1000 based card on board.
Again this works fine initially but then we get a very dramatic failure with no warning messages and the system falls over.
There are still a few issues to be ironed out with the FreeBSD 5.x on AMD64 the latest STABLE/PRE-RELEASE is much improved but be aware there may be issues. We will be waiting a few more weeks before re-trying these tests to see if the latest fixes that have been discussed have solved our problems.
We did some tests with the latest 5.3-STABLE / 5.4-PRERELEASE and > still found> the same issues when using a mySQL database heavily hit over the > Ethernet> controller. Our final tests limited the memory on boot-up to 4Gb and > the bug> is still there so we think it may well be some interaction with the > Ethernet> controller. The motherboard we have has a BroadcomBCM5704C > 10/100/1000 based> card on board.>
I have seen similar with Postgres 8.0 database. Occasionally I'll see a bge0 timout + reset error logged, but many times I'll just see a "socket closed unexpectedly" type of message from postgres. So far, every 5 days or so, the machine freezes during heavy DB reporting over the net. I have a S2881 mobo, though, with 4GB.
I had another identical machine which was reporting in the BIOS that the memory size changed during normal operations, which was very scary...
Willem Jan Withagen 5 April 2005 11:33:24 [ permanent link ]
Greg 'groggy' Lehey wrote:> I've recently acquired an AMD64 box (dual Opteron 242, SiS Master@-FAR> motherboard> (http://www.msi.com.tw/program/products/server/svr/pro_svr_detail.php?UID=484).> See below for more details). I find it very unstable running with 8> GB memory, though 4 GB are not a problem. At first I thought it was> the onboard peripherals, but after disabling them it still persisted.>
What's unstable? I only once got it through the boot process.> Running a 5.3-RELEASE i386 kernel it panics, though I haven't> investigated the panic (yet), since I'm not interested in the i386> kernel. The amd64 5.4-PRERELEASE kernel just hangs/freezes. When the> peripherals are enabled, it's after probing the onboard NIC (bge) and> before probing SATA (no drives present). I've done a verbose boot, of> course, but no additional information is present. The NIC is> recognized, and that's all.>
Without the peripherals, but with a 3Com 3c905 PCI NIC, it continues> beyond this point, but doesn't enable the NIC. I don't have dmesg> output for these attempts, so I can't produce the exact message, and I> suspect it's not important. It continues until trying to mount NFS> file systems, where it hangs for obvious reasons. Pressing ^C causes> the system to either panic (and be unable to dump because I don't have> that much swap) or just hang.>
None of these problems occur when I use 4 GB memory. About the only> strangeness, which seems to come from the BIOS, is that it recognizes> only 3.5 GB. If I put all DIMMS in, it recognizes the full 8 GB> memory.>
I realize that this isn't enough to diagnose the problem. The reason> for this message now is to ask:>
1. Has anybody else seen this problem?
Hi Greg,
[Currently little time so I'll dig the archives later for more details]
I'm sorry to come into this discussion after 58 messages, but this board has been extensively discussed about 1 year ago, because it gave me trouble to no end (even with 2Gb). One of the early amd64 developers (not David or Scott) had the same board but could not get it stable under amd64 (i386 was fine with 2Gb). He tossed it, and suggested me to do the same. Which I did, and went to a Tyan board S278. After that there where no more problems at all. At the time I think things we're at 5.1 so now with 5.3 some features might have made the board act more stable.
Willem Jan Withagen 5 April 2005 15:24:50 [ permanent link ]
Greg 'groggy' Lehey wrote:> I've recently acquired an AMD64 box (dual Opteron 242, SiS Master@-FAR> motherboard> (http://www.msi.com.tw/program/products/server/svr/pro_svr_detail.php?UID=484).> See below for more details). I find it very unstable running with 8> GB memory, though 4 GB are not a problem. At first I thought it was> the onboard peripherals, but after disabling them it still persisted.>
What's unstable? I only once got it through the boot process.> Running a 5.3-RELEASE i386 kernel it panics, though I haven't
1. Has anybody else seen this problem?> 2. Has anybody else used this hardware configuration and *not* seen> this problem?
[Posted something like thisearlier, but did not see it on the list]
Little late to the discussion, but none the less.
I bought this board over a year ago to run amd64 (you're running i386). But in the end I trashed it for running amd64 since one of then involved developers also tried to use the board without much success. It was running fine with amd64, 1 CPU, 2Gb, but as soon as I added the 2nd CPU the slightest load crashed the system somewhere in IPI-areas. After long discussions I came to the point that it was easier to get a new motherboard, so I got a Tyan Tiger S275. Which as not yet failed on me.
So if ever you'd like to run amd64 on this system, even now you've determined the problem to be the load on the memory-bus, be warned that odd things could be happening.
If you would like to report an abuse of our service, such as a spam message, please . Если Вы хотите пожаловаться на содержимое этой страницы, пожалуйста .