Storage Performance 2013 Joe Chang qdpma #SQLSatRiyadh

92 Slides5.98 MB

Storage Performance 2013 Joe Chang www.qdpma.com #SQLSatRiyadh

About Joe SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?) Database with distribution statistics only, no data 2004 Decoding statblob/stats stream – writing your own statistics Disk IO cost structure See Tools for system ExecStats on monitoring, execution plan analysis www.qdpma.com

Storage Performance Chain SQL Server All elements must be correct Engine SQL Server – No weak links Perfect on 6 out 7 elements Extent SQL Server File and 1 not correct bad IO performance Dir At/SAN Pool SAS/FC RAID Group SAS HDD SDD

Storage Performance Overview System Architecture – PCI-E, SAS, HBA/RAID controllers SSD, NAND, Flash Controllers, Standards – Form Factors, Endurance, ONFI, Interfaces SLC, MLC Performance Storage system architecture – Direct-attach, SAN Database – SQL Server Files, FileGroup

Sandy Bridge EN & EP EN QPI C3 C2 C1 C0 PCI-E LLC MI C4 C5 C6 C7 QPI C3 C2 C1 C0 QPI x4 PCI-E x8 QPI MI PCI-E x8 PCI-E C4 C5 C6 C7 LLC PCI-E x8 MI PCIe x8 PCIe x8 PCIe x8 DMI 2 MI PCI-E Xeon E5-2400, Socket B2 1356 pins 1 QPI 8 GT/s, 3 DDR3 memory channels 24 PCI-E 3.0 8GT/s, DMI2 (x4 @ 5GT/s) E5-2470 8 core 2.3GHz 20M 8.0GT/s (3.1) E5-2440 6 core 2.4GHz 15M 7.2GT/s (2.9) E5-2407 4c – 4t 2.2GHz 10M 6.4GT/s (n/a) PCH EP QPI C3 C2 C1 C0 LLC MI C4 C5 C6 C7 C3 C2 C1 C0 QPI QPI MI PCI-E C4 C5 C6 C7 LLC MI MI x4 PCI-E x8 PCI-E x8 PCI-E x8 PCI-E x8 PCI-E x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 DMI 2 PCH 80 PCI-E gen 3 lanes 8 gen 2 possible Dell T620 4 x16, 2 x8, 1 x4 Dell R720 1 x16, 6 x8 HP DL380 G8p 2 x16, 3 x8, 1 x4 Supermicro X9DRX F 10 x8, 1 x4 g2 Xeon E5-2600, Socket: R 2011-pin 2 QPI, 4 DDR3, 40 PCI-E 3.0 8GT/s, DMI2 Model, cores, clock, LLC, QPI, (Turbo) E5-2690 8 core 2.9GHz 20M 8.0GT/s (3.8)* E5-2680 8 core 2.7GHz 20M 8.0GT/s (3.5) E5-2670 8 core 2.6GHz 20M 8.0GT/s (3.3) E5-2667 6 core 2.9GHz 15M 8.0GT/s (3.5)* E5-2665 8 core 2.4GHz 20M 8.0GT/s (3.1) E5-2660 8 core 2.2GHz 20M 8.0GT/s (3.0) E5-2650 8 core 2.0GHz 20M 8.0GT/s (2.8) E5-2643 4 core 3.3GHz 10M 8.0GT/s (3.5)* E5-2640 6 core 2.5GHz 15M 7.2GT/s (3.0) Disable cores in BIOS/UEFI?

Xeon E5-4600 PCI-E C3 C2 C1 C0 C4 C5 C6 C7 LLC MI C4 C5 C6 C7 LLC MI QPI QPI PCI-E MI QPI MI C3 C2 C1 C0 QPI PCI-E MI QPI PCI-E LLC PCI-E C4 C5 C6 C7 PCI-E PCI-E PCI-E PCI-E PCI-E C3 C2 C1 C0 PCI-E PCI-E PCI-E PCI-E PCI-E QPI QPI C3 C2 C1 C0 QPI MI PCI-E C4 C5 C6 C7 LLC MI MI Xeon E5-4600 Socket: R 2011-pin 2 QPI, 4 DDR3 40 PCI-E 3.0 8GT/s, DMI2 Model, cores, Clock, LLC, QPI, (Turbo) E5-4650 8 core 2.70GHz 20M 8.0GT/s (3.3)* E5-4640 8 core 2.40GHz 20M 8.0GT/s (2.8) E5-4620 8 core 2.20GHz 16M 7.2GT/s (2.6) E5-4617 6c - 6t 2.90GHz 15M 7.2GT/s (3.4) E5-4610 6 core 2.40GHz 15M 7.2GT/s (2.9) E5-4607 6 core 2.20GHz 12M 6.4GT/s (n/a) E5-4603 4 core 2.00GHz 10M 6.4GT/s (n/a) PCI-E PCI-E PCI-E PCI-E PCI-E PCI-E DMI 2 PCI-E PCI-E PCI-E PCI-E PCI-E Hi-freq 6-core gives up HT No high-frequency 4-core, Dell R820 2 x16, 4 x8, 1 int HP DL560 G8p 2 x16, 3 x8, 1 x4 Supermicro X9QR 7 x16, 1 x8 160 PCI-E gen 3 lanes 16 gen 2 possible

2 PCI-E, SAS & RAID CONTROLLERS

PCI-E gen 1, 2 & 3 Gen Raw bit Unencoded rate PCIe 1 2.5GT/s 2Gbps Bandwidth per direction 250MB/s BW x8 Net Bandwidth Per direction x8 2GB/s 1.6GB/s PCIe 2 5.0GT/s 4Gbps 500MB/s 4GB/s 3.2GB/s PCIe 3 8.0GT/s 8Gbps 1GB/s 8GB/s 6.4GB/s? PCIe 1.0 & 2.0 encoding scheme 8b/10b PCIe 3.0 encoding scheme 128b/130b Simultaneous bi-directional transfer Protocol Overhead – Sequence/CRC, Header – 22 bytes, (20%?) Adaptec Series 7: 6.6GB/s, 450K IOPS

PCI-E Packet Net realizable bandwidth appears to be 20% less (1.6GB/s of 2.0GB/s)

PCIe Gen2 & SAS/SATA 6Gbps SATA 6Gbps – single lane, net BW 560MB/s SAS 6Gbps, x 4 lanes, net BW 2.2GB/s – Dual-port, SAS protocol only Not supported by SATA SAS x4 6G A A B A B 2.2GB/s PCIe g2 x8 3.2GB/s HBA SAS x4 6G B Some bandwidth mismatch is OK, especially on downstream side

PCIe 3 & SAS 12Gbps – coming soon? Slowly? – Infrastructure will take more time SAS x4 6G SAS x4 6G PCIe g3 x8 HBA SAS x4 6G SAS x4 6G SAS x4 6Gb SAS Expander SAS x4 6Gb SAS x4 12G PCIe g3 x8 HBA SAS x4 12G SAS x4 6Gb SAS Expander SAS x4 6Gb PCIe 3.0 x8 HBA 2 SAS x4 12Gbps ports or 4 SAS x4 6Gbps port if HBA can support 6GB/s

PCIe Gen3 & SAS 6Gbps

LSI 12Gpbs SAS 3008

PCIe RAID Controllers? 2 x4 SAS 6Gbps ports (2.2GB/s per x4 port) – 1st generation PCIe 2 – 2.8GB/s? – Adaptec: PCIe g3 can do 4GB/s – 3 x4 SAS 6Gbps bandwidth match PCIe 3.0 x8 6 x4 SAS 6Gpbs – Adaptec Series 7, PMC – 1 Chip: x8 PCIe g3 and 24 SAS 6Gbps lanes Because they could SAS x4 6G SAS x4 6G SAS x4 6G PCIe g3 x8 HBA SAS x4 6G SAS x4 6G SAS x4 6G

2 SSD, NAND, FLASH CONTROLLERS

SSD Evolution HDD replacement – using existing HDD infrastructure – PCI-E card form factor lack expansion flexibility Storage system designed around SSD – PCI-E interface with HDD like form factor? – Storage enclosure designed for SSD Rethink computer system memory & storage Re-do the software stack too!

SFF-8639 & Express Bay SCSI Express – storage over PCI-E, NVM-e

New Form Factors - NGFF Enterprise 10K/15K HDD - 15mm 15mm SSD Storage Enclosure could be 1U, 75 5mm devices?

SATA Express Card (NGFF) Crucial mSATA M2

SSD – NAND Flash NAND – SLC, MLC regular and high-endurance – eMLC could mean endurance or embedded - differ Controller interfaces NAND to SATA or PCIE Form Factor – SATA/SAS interface in 2.5in HDD or new form factor – PCI-E interface and FF, or HDD-like FF – Complete SSD storage system

NAND Endurance Intel – High Endurance Technology MLC

NAND Endurance – Write Performance Endurance SLC MLC-e MLC Write Performance Cost Structure MLC 1 MLC EE 1.3 SLC 3 Process depend. 34nm 25nm 20nm Write perf?

NAND P/E - Micron 34 or 25nm MLC NAND is probably good Database can support cost structure

NAND P/E - IBM 34 or 25nm MLC NAND is probably good Database can support cost structure

Write Endurance Vendors commonly cite single spec for range of models, 120, 240, 480GB Should vary with raw capacity? Depends on overprovioning? 3 year life is OK for MLC cost structure, maybe even 2 year MLC 20TB / life 10GB/day for 2000 days (5 years ), 20GB/day – 3 years Vendors now cite 72TB write endurance for 120-480GB capacities?

NAND SLC – fast writes, high endurance eMLC – slow writes, medium endurance MLC – medium writes, low endurance MLC cost structure of 1/GB @ 25nm – eMLC 1.4X, SLC 2X?

ONFI Open NAND Flash Interface organization 1.0 2006 – 50MB/s 2.0 2008 – 133MB/s 2.1 2009 – 166 & 200MB/s 3.0 2011 – 400MB/s – Micron has 200 & 333MHz products ONFI 1.0 – 6 channels to support 3Gbps SATA, 260MB/s ONFI 2.0 – 4 channels to support 6Gbps SATA, 560MB/s

NAND write performance MLC 85MB/s per 4-die channel (128GB) 340MB/s over 4 channels (512GB)?

Controller Interface PCIe vs. SATA NAND NAND Some bandwidth mistmatch/overkill OK ONFI 2 – 8 channels at 133MHz to SATA 6Gbps – 560 MB/s a good match NAND NAND Controller PCIe or SATA? Multiple lanes? NAND NAND NAND But ONFI 3.0 is overwhelming SATA 6Gbps? NAND NAND 6-8 channel at 400MB/s to match 2.2GB/s x4 SAS? 16 channel at 400MB/s to match 6.4GB/s x8 PCIe 3 CPU access efficiency and scaling Intel & NVM Express

Controller Interface PCIe vs. SATA PCIe SATA DRAM DRAM Controller Controller NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND PCIe NAND Controller Vendors Vendor Channels PCIe Gen IDT 32 x8 Gen3 NVMe Micron 32 x8 Gen2 Fusion-IO 3x4? X8 Gen2? NAND

SATA & PCI-E SSD Capacities 64 Gbit MLC NAND die 150mm2 25nm 1 64 Gbit die 8 x 64 Gbit die in 1 package 64GB SATA Controller – 8 channels, 8 package x 64GB 512GB PCI-E Controller – 32 channels x 64GB 2TB 2 x 32 Gbit 34nm 1 x 64 Gbit 25nm 1 x 64 Gbit 29nm

PCI-E vs. SATA/SAS SATA/SAS controllers have 8 NAND channels – No economic benefit in fewer channels? – 8 ch. Good match for 50MB/s NAND to SATA 3G 3Gbps – approx 280MB/s realizable BW – 8 ch also good match for 100MB/s to SATA 6G 6Gbps – 560MB/s realizable BW – NAND is now at 200 & 333MB/s PCI-E – 32 channels practical – 1500 pins – 333MHz good match to gen 3 x8 – 6.4GB/s BW

Crucial/Micron P400m & e Crucial P400m Raw Seq Read (up to) Seq Write (up to) Random Read Random Write Endurance 2M-hr MTBF Price 100GB 168GB 380MB/s 200MB/s 52K 21K 1.75PB 300? 200GB 336GB 380MB/s 310MB/s 54K 26K 3.0PB 600? 400GB 672GB 380MB/ 310MB/s 60K 26K 7.0PB 1000? 200GB 256 350MB/s 140MB/s 50K 7.5K 175TB 334 400GB 512 350MB/ 140MB/s 50K 7.5K 175TB 631 Preliminary – need to update Crucial P400e Raw Seq Read (up to) Seq Write (up to) Random Read Random Write Endurance 1.2M-hr MTBF Price 100GB 128 350MB/s 140MB/s 50K 7.5K 175TB 176 P410m SAS specs slightly different EE MLC Higher endurance write perf not lower than MLC?

Crucial m4 & m500 Crucial m500 Raw Seq Read (up to) Seq Write (up to) Random Read Random Write Endurance 1.2M-hr MTBF Price 120GB 128GB 500MB/s 130MB/s 62K 35K 72TB 130 240GB 256GB 500MB/s 250MB/s 72K 60K 72TB 220 480GB 512GB 500MB/ 400MB/s 80K 80K 72TB 400 256GB 256 415MB/s 260MB/s 40K 50K 72TB 212 512GB 512 415MB/ 260MB/s 40K 50K 72TB 400 Preliminary – need to update Crucial m4 Raw Seq Read (up to) Seq Write (up to) Random Read Random Write Endurance Price 128GB 128 415MB/s 175MB/s 40K 35K 72TB 112 960GB 1024 600

Micron & Intel SSD Pricing (2013-02) 1,000 900 800 700 600 500 400 300 200 100 0 e e N r o c d 100/128 e t c e r P d 200/256 m 0 40 g n i c i r p 400/512 P400m raw capacities are 168, 336 and 672GB (pricing retracted) Intel SSD DC S3700 pricing 235, 470, 940 and 1880 (800GB) respectively m400 P400e P400m S3700

4K Write K IOPS 60 50 40 30 20 10 0 e e N r o c d 100/128 e t c e r P d 200/256 m 0 40 g n i c i r p 400/512 P400m raw capacities are 168, 336 and 672GB (pricing retracted) Intel SSD DC S3700 pricing 235, 470, 940 and 1880 (800GB) respectively m400 P400e P400m S3700

SSD Summary MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD Endurance (HET) MLC – write perf? – Standard DB practice work – But avoid frequent index defrags? SLC – only extreme write intensive? – Lower volume product – higher cost

3 DIRECT ATTACH STORAGE

Full IO Bandwidth QPI 192 GB 192 GB QPI PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 RAID RAID RAID RAID Infini Band Infini Band RAID RAID RAID RAID SSD SSD SSD HDD SSD SSD SSD HDD HDD Misc SSD SSD HDD HDD PCIe x4 PCIe x4 10GbE HDD HDD HDD 10 PCIe g3 x8 slots possible – Supermicro only – HP, Dell systems have 5-7 x8 slots 1 x4? 4GB per slot with 2 x4 SAS, – 6GB/s with 4 x4 Mixed SSD HDD – reduce wear on MLC Misc devices on 2 x4 PCIe g2, Internal boot disks, 1GbE or 10GbE, graphics

System Storage Strategy QPI 192 GB 192 GB QPI RAID RAID RAID RAID SSD SSD SSD SSD HDD HDD HDD HDD PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE IB Dell & HP only have 5-7 slots 4 Controllers @ 4GB/s each is probably good enough? Few practical products can use PCIe G3 x16 slots Capable of 16GB/s with initial capacity – 4 HBA, 4-6GB/s each with allowance for capacity growth – And mixed SSD HDD

Clustered SAS Storage Node 1 Node 2 QPI QPI QPI QPI 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 Dell MD3220 supports clustering Upto 4 nodes w/o external switch (extra nodes not shown) SAS SAS SAS SAS Host Host Host Host IOC 2GB SSD SSD SSD SSD HDD HDD HDD HDD PCIE Switch SAS Exp Host Host Host Host 2GB IOC PCIE Exp Host Host Host Host 2GB IOC PCIE Exp SAS SAS SAS SAS Host Host Host Host IOC 2GB PCIE Switch SAS Exp

Alternate SSD/HDD Strategy QPI 192 GB Backup System 192 GB QPI SSD SSD RAID RAID RAID RAID HDD HDD HDD HDD HDD Primary System – All SSD for data & temp, – logs may be HDD Secondary System – HDD for backup and restore testing PCIe x4 SSD IB PCIe x8 SSD IB PCIe x8 HBA PCIe x8 HBA PCIe x8 HBA PCIe x8 HBA PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE

System Storage Mixed SSD HDD QPI 192 GB 192 GB QPI HBA x8 x8 HBA x8 HBA x8 x8 x4 10GbE HBA SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD IB Each RAID Group-Volume should not exceed 2GB/s BW of x4 SAS 2-4 volumes per x8 PCIe G3 slot SATA SSD read 350-500MB/s, write 140MB/s 8 per volume allows for some overkill 16 SSD per RAID Controller 64 SATA/SAS SSD’s to deliver 1624GB/s 4 HDD per volume rule does not apply HDD for local database backup, restore tests, and DW flat files SSD & HDD on shared channel – simultaneous bi-directional IO

SSD/HDD System Strategy MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD Hybrid SSD HDD system, full-duplex signalling Endurance (HET) MLC – write perf? – Standard DB practice work, avoid index defrags SLC – only extreme write intensive? – Lower volume product – higher cost HDD – for restore testing

SAS Expander 2 x4 to hosts 1 x4 for expansion 24 x1 for disks Disk Enclosure expansion ports not shown

Storage Infracture – designed for HDD 15mm 2U 2 SAS Expanders for dual-port support – 1 x4 upstream (to host), 1 x4 downstream (expan) – 24 x1 for bays

Mixed HDD SSD Enclosure 15mm 2U Current: 24 x 15mm 360mm spacing Proposed 16 x15mm 240mm 16x7mm 120

Enclosure 24x15mm and proposed Host 384 GB SAS SAS SAS SAS Expander SAS Expander SAS SAS SAS Expander x8 PCIe x8 HBA SAS SAS SAS SAS x4 6 Gpbs 2.2GB/s PCIe x8 PCIe x8 PCIe HBA Host SAS Expander 384 GB SAS x4 12 Gpbs 4GB/s Current 2U Enclosure, 24 x 15mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12Gbps 16 x 15mm 16 x 7mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays

Enclosure 24x15mm and proposed 384 GB Host 384 GB x8 PCIe x8 PCIe x8 PCIe x8 PCIe SAS Expander SAS SAS SAS SAS Expander SAS SAS SAS SAS SAS SAS Expander HBA SAS Expander HBA Host Current 2U Enclosure, 24 x 15mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12Gbps 16 x 15mm 16 x 7mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays

Alternative Expansion SAS x4 Enclosure 3 SAS x4 SAS x4 SAS x4 Expander PCIe x8 HBA Expander Host SAS x4 Expander SAS x4 Expander Enclosure 1 SAS x4 Enclosure 4 SAS x4 Enclosure 2 Each SAS expander – 40 lanes, 8 lanes upstream to host with no expansion or 4 lanes upstream and 5 lanes downstream for expansion 32 lanes for bays

PCI-E with Expansion 384 GB Host 384 GB x8 SAS x8 SAS Expander SAS x8 SAS SAS Expander x8 x8 PCIe x8 PCI-E Switch x8 SAS SAS x4 6 Gpbs 2.2GB/s PCIe x8 PCIe x8 PCIe HBA Express bay form factor? Host Few x8 ports? or many x4 ports? PCI-E slot SSD suitable for known capacity 48 & 64 lanes PCI-E switches available – x8 or x4 ports

Enclosure for SSD ( HDD?) 2 x4 on each expander upstream – 4GB/s – No downstream ports for expansion? 32 ports for device bays – 16 SSD (7mm) 16 HDD (15mm) 40 lanes total, no expansion – 48 lanes with expansion

Large SSD Array Large number of devices, large capacity – Downstream from CPU has excess bandwidth Do not need SSD firmware peak performance – 1 ) no stoppages, 2) consistency is nice Mostly static data – some write intensive – Careful use of partitioning to avoid index rebuild and defragmentation – If 70% is static, 10% is write intensive Does wear leveling work

4 DATABASE – SQL SERVER

Database Environment OLTP DW Databases are very high value – Software license development is huge – 1 or more full time DBA, several application developers, and help desk personnel – Can justify any reasonable expense – Full knowledge of data (where the writes are) – Full control of data (where the writes are) – Can adjust practices to avoid writes to SSD

Database – Storage Growth 10GB per day data growth Big company – 10M items at 1KB per row (or 4 x 250 byte rows) – 18 TB for 5 years (1831 days) – Database log can stay on HDD Heavy system – 64-128 x 256/512GB (raw) SSD – Each SSD can support 20GB/day (36TB lifetime?) With Partitioning – few full index rebuilds Can replace MLC SSD every 2 years if required

Extra Capacity - Maintenance Storage capacity will be 2-3X database size – It will be really stupid if you cannot update application for lack of space to modify a large table – SAN evironment Only required storage capacity allocated May not be able to perform maintenance ops – If SAN admin does not allocate extra space

SSD/HDD Component Pricing 2013 MLC consumer 1.0K/TB MLC Micron P400e 1.2K/TB MLC endurance 2.0K/TB SLC 4K? HDD 600GB, 10K 400

Database Storage Cost 8 x256GB (raw) SSD per x4 SAS channel 2TB 2 x4 ports per RAID controller, 4TB/RC 4 RAID Controller per 2 socket system, 16TB 32 TB with 512GB SSD, 64TB with 1TB, – 64 SSD per system at 250 (MLC) 16K – 64 HDD 10K 600GB 400 26K – Server 2xE5, 24x16GB, qty 2 12K each – SQL Server 2012 EE 6K x 16 cores 96K HET MLC and even SLC premium OK Server/Enterprise premium – high validation effort, low volume, high support expectations

OLTP & DW OLTP – backup to local HDD – Superfast backup, read 10GB/s, write 3GB/s (R5) – Writes to data blocked during backup – Recovery requires log replay DW – example: 10TB data, 16TB SSD – Flat files on HDD – Tempdb will generate intensive writes (1TB) Database (real) restore testing – Force tx roll forward/back, i.e., need HDD array

SQL Server Storage Configuration IO system must have massive IO bandwidth – IO over several channels Database must be able to use all channels simultaneously – Multiple files per filegroups Volumes / RAID Groups on each channel – Volume comprised of several devices

HDD, RAID versus SQL Server HDD – pure sequential – not practical, impossible to maintain – Large block 256K good enough 64K OK RAID Controller – 64K to 256K stripe size SQL Server – Default extent allocation: 64K per file – With –E, 4 consecutive extents – why not 16?

File Layout Physical View 192 GB QPI 192 GB QPI x8 HBA x8 HBA x8 x8 x4 10GbE HBA HBA Each Filegroup and tempdb has 1 data file on every data volume IO to any object is distributed over all paths and all disks

Filegroup & File Layout Disk 2 Basic Controller 1 Port 0 FileGroup A, File 1 FileGroup B, File 1 Tempdb File 1 Disk 3 Basic Controller 1 Port 1 FileGroup A, File 2 FileGroup B, File 2 Tempdb File 2 Disk 4 Basic Controller 2 Port 0 FileGroup A, File 3 FileGroup B, File 3 Tempdb File 3 Disk 5 Basic Controller 2 Port 1 FileGroup A, File 4 FileGroup B, File 4 Tempdb File 4 Disk 6 Basic Controller 3 Port 0 FileGroup A, File 5 FileGroup B, File 5 Tempdb File 5 Disk 7 Basic Controller 3 Port 1 FileGroup A, File 6 FileGroup B, File 6 Tempdb File 6 As shown, 2 RAID groups per controller, 1 per port. Can be 4 RG/volume per Ctlr Disk 8 Basic Controller 4 Port 0 FileGroup A, File 7 FileGroup B, File 7 Tempdb File 7 OS and Log disks not shown Disk 9 Basic Controller 4 Port 1 FileGroup A, File 8 FileGroup B, File 8 Tempdb File 8 Each File Group has 1 file on each data volume Each object is distributed across all data “disks” Tempdb data files share same volumes

RAID versus SQL Server Extents Disk 2 Basic 1112GB Online Controller 1 Port 0 Disk 3 Basic 1112GB Online Controller 1 Port 1 Disk 4 Basic 1112GB Online Controller 2 Port 0 Disk 5 Basic 1112GB Online Controller 2 Port 1 Extent 1 Extent 17 Extent 33 Extent 2 Extent 18 Extent 34 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Extent 5 Extent 21 Extent 37 Extent 6 Extent 22 Extent 38 Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 11 Extent 27 Extent 43 Extent 15 Extent 31 Extent 47 Extent 12 Extent 28 Extent 44 Extent 16 Extent 32 Extent 48 Default: allocate 1 extent from file 1, allocate extent 2 from file 2, Disk IO – 64K Only 1 disk in each RAID group is active

Consecutive Extents -E Disk 2 Basic 1112GB Online Controller 1 Port 0 Disk 3 Basic 1112GB Online Controller 1 Port 1 Disk 4 Basic 1112GB Online Controller 2 Port 0 Disk 5 Basic 1112GB Online Controller 2 Port 1 Extent 1 Extent 17 Extent 33 Extent 5 Extent 21 Extent 37 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 2 Extent 18 Extent 34 Extent 6 Extent 22 Extent 38 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Allocate 4 consecutive extents from each file, OS issues 256K Disk IO Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 11 Extent 27 Extent 43 Extent 12 Extent 28 Extent 44 Extent 15 Extent 31 Extent 47 Extent 16 Extent 32 Extent 48 Each HDD in RAID group sees 64K IO Upto 4 disks in RG gets IO

Storage Summary OLTP – endurance MLC or consumer MLC? DW - MLC w/ higher OP QA – consumer MLC or endurance MLC? Tempdb – possibly SLC Single log – HDD, multiple logs: SSD? Backups/test Restore/Flat files – HDD No caching, no auto-tiers

SAN

Software Cache Tier

Cache Auto-Tier Good idea if 1) No knowledge 2) No control In Database We have 1) Full knowledge 2) Full Control Virtual file stats Filegroups partitioning

Common SAN Vendor Configuration Node 1 Node 2 768 GB 768 GB Switch 8 Gbps FC Switch or 10Gbps FCOE SP A SP B 24 GB 24 GB x4 SAS 2GB/s Main Volume Log volume SSD 10K 7.2K Hot Spares Multi-path IO: perferred port alternate port Single large volume for data, additional volumes for log, tempdb, etc All data IO on single FC port 700MB/s IO bandwidth Path and component fault-tolerance, poor IO performance

Multiple Paths & Volumes 3 Node 1 768 GB Node 2 768 GB x8 x8 x8 x8 x8 x8 SSD SSD 8 Gb FC Switch Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB 24 GB x4 SAS 2GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes

Multiple Paths & Volumes 2 Node 1 768 GB Node 2 768 GB x8 x8 x8 x8 x8 x8 SSD SSD 8 Gb FC Switch Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB 24 GB x4 SAS 2GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes

8Gbps FC rules 4-5 HDD RAID Group/Volumes – SQL Server with –E only allocates 4 consecutive extents 2 Volumes per FC port – Target 700MB/s per 8Gbps FC port SSD Volumes – Limited by 700-800MB/s per 8Gbps FC port – Too many ports required for serious BW – Management headache from too many volumes

SQL Server SQL Server table scan to – heap generates 512K IO, easy to hit 100MB/s/disk – (clustered) index 64K IO, 30-50MB/s per disk likely

EMC VNX 5300 FT DW Ref Arch

iSCSI & File structure x4 x4 x4 x4 x4 x4 10GbE 10GbE 10GbE 10GbE 10GbE 10GbE RJ45 Controller 1 DB1 files SFP Controller 2 DB2 files RJ45 Controller 1 DB1 files SFP Controller 2 DB2 files RJ45 Controller 1 DB1 file 1 DB2 file 1 SFP Controller 2 DB1 file 2 DB2 file2

EMC VMAX

EMC VMAX orig and 2nd gen Front End Back End Front End Front End/ Back End Ports Front End/ Back End Ports CPU Complex Back End CMI-II CPU Complex Global Memory · · · · · 2.3 GHz Xeon (Harpertown) 16 CPU cores 128 GB cache memory (maximum) Dual Virtual Matrix PCIe Gen1 · · · · · 2.8 GHz Xeon w/turbo (Westmere) 24 CPU cores 256 GB cache memory (maximum) Quad Virtual Matrix PCIe Gen2

EMC VMAX 10K

EMC VMAX Virtual Matrix Virtual Matrix

VMAX Director

EMC VMAX Director VMAX 10K new Upto 4 engines, 1 x 6c 2.8G per dir 50GB/s VM BW? 16 x 8Gbps FC per engine FC HBA FC HBA SAS IOH Director SAS VMI FC HBA FC HBA VMI SAS VMAX 20K Engine 4 QC 2.33GHz 128GB Virtual Maxtrix BW 24GB/s System - 8 engines, 1TB, VM BW 192GB/s, 128 FE ports IOH IOH IOH VMI VMI VMI VMI VMAX Engine? SAS Director VMAX 40K Engine 4 SC 2.8GHz 256GB Virtual Maxtrix BW 50GB/s System - 8 engines, 2TB, VM BW 400GB/s, 128 FE ports RapidIO IPC 3.125GHz, 2.5Gb/s 8/10 4 lanes per connection 10Gb/s 1.25GB/s, 2.5GB/s full duplex 4 Conn per engine - 10GB/s 36 PCI-E per IOH, 72 combined 8 FE, 8 BE 16 VMI 1, 32 VMI 2

SQL Server Default Extent Allocation Data Extent 1 Extent 5 Extent 9 Extent 13 file 1 Extent 17 Extent 21 Extent 25 Extent 29 Extent 33 Extent 37 Extent 41 Extent 45 Data Extent 2 Extent 6 Extent 10 Extent 14 file 2 Extent 18 Extent 22 Extent 26 Extent 30 Extent 34 Extent 38 Extent 42 Extent 46 Data Extent 3 Extent 7 Extent 11 Extent 15 file 3 Extent 19 Extent 23 Extent 27 Extent 31 Extent 35 Extent 39 Extent 43 Extent 47 Allocate 1 extent per file in round robin Proportional fill EE/SE table scan tries to stay 1024 pages ahead? Data Extent 4 Extent 8 Extent 12 Extent 16 file 4 Extent 20 Extent 24 Extent 28 Extent 32 Extent 36 Extent 40 Extent 44 Extent 48 SQL can read 64 contiguous pages from 1 file. The storage engine reads index pages serially in key order. Partitioned table support for heap organization desired?

SAN Node 1 Node 2 768 GB 768 GB Node 1 QPI 8 Gb FC Switch Switch SP A SP B 24 GB 24 GB QPI 192 GB HBA x4 SAS 2GB/s Volume 1 Data Volume 2 Data Volume 3 Data Volume 4 Data Volume . Data Volume . Data Volume 15 Data Volume 16 Data Volume - Log Log SSD 1 SSD 2 SSD . SSD 8 Log SSD Node 2 10K HBA 192 GB HBA HBA 192 GB HBA HBA 192 GB HBA HBA

Clustered SAS SAS In Node 2 SAS Out Node 1 Node 2 768 GB QPI QPI 192 GB Node 1 Node 1 768 GB 768 GB SAS SAS SAS SAS Host Host Host Host IOC 2GB PCIE Switch SAS Exp HBA SAS SAS SAS SAS Host Host Host Host IOC 2GB Host Host Host Host IOC SAS Exp PCIE Switch RAID RAID RAID RAID SSD SSD SSD SSD HDD HDD HDD HDD PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE HBA IB Switch SAS Exp 192 GB HBA 192 GB HBA HBA Host Host Host Host IOC Switch SAS Exp Exp HBA 192 GB HBA Host Host Host Host HBA

Fusion-IO ioScale

Back to top button