It’s nearly here… SharePoint 2010

April 9, 2010

In May this year,

just after the SQL Server 2008 R2 release (which will be made at the SQL Pass Europe conference on April 21st)

Comes the big guns, SharePoint 2010 and Office 14, I mean the Office 2010 system.

Right, smack in the middle of EMC World 2010 in Boston, Mass.

We will be there with architectural guidance, performance and scalability and how the new built-in Search capability weights up against the old built-in search capability in SharePoint 2007.   Come see us in the EMC Proven Solutions Pavillon.

Our next steps after that? Scale up and bring FAST search server for the Enterprise into play.

One of the bigger challenges for us in the lab is to devel into testing more functions and features available in the new SA model versus the old SSP, but we’ll get them.  More to come on that as we progress.

 The Technet documentation build-up is going fast and furously right now, a recent update was the SharePoint 2010 hardware and software requirements.

We’ll keep in touch with our progress.

James.

Advertisements

The iSCSI performance issue…

April 9, 2010

Just an update to the iSCSI performance issue.

Microsoft has released the KB article we worked on last week

http://support.microsoft.com/kb/2020559

Due to the significant change required, the design change request has been submitted and will not be in Windows 7, but is currently being considered for Windows 8 platform.


Best Practice for Hyper-V with iSCSI

March 10, 2010

Hello folks.

In EMC Proven Solution testing for a given use case, I have come across a serious issue in relation to iSCSI responses, which inheritently causes slow storage response times, very slow cluster polling and enumeration.

Test config
Windows 2008 R2 with Hyper-V.   6-node Hyper-V cluster.  65 Disks.  2x iSCSI NIC per node.

What I saw was a slowness in the cluster in bringing a VM’s Virtual Machine Configuration cluster resource online which had a large amount of disks configured.  They would time out as they passed their default pending and deadlock time-outs.

Firstly, when you online, refresh (or failover) a VM configuration, Hyper-V performs a sanity check to ensure the underlying components of the VM (network, storage, etc) are available.  This means scanning all the disks.
So, say I had a VM with 25 disks (in my case), the VM config took over 10 minutes to online!

Why?  Well working with Microsoft OEM Support, they asked me to try to tighten the TCPAckFrequency to 1(millisecond).  I say, OK I’ll try it.
This brought the online time from 10 minutes to 19 seconds!  Result…or maybe…

I needed to fully understand the issue, so out came WireShark in order run some Ethernet traces…

Let me explain what the actual issue is…

The problem is basically that iSCSI is a victim of Nagle’s Algorithm.  Optimization of TCP networks in terms of minimizing congestion due to TCP overhead.  iSCSI is essentially stung by send colalescing.

Windows 2003 onwards, the default TCP acknowledge time is 200ms.
This means that if a TCP segment (1462 bytes) is not full, it may need to wait up to 200ms before the data actually sent from the Windows host.

This is a problem when using iSCSI for two reasons
1) You need the fastest response time possible from storage for your application
2) iSCSI payloads can be very small, esp SCSI control OpCode (CDBs).

Now, the iSCSI CDB (control/query) OpCode commands involved in enumerating the disks during the online action have a tiny payload (10 bytes).

From looking at the ethernet trace, the cluster disk driver is performing SCSI(10) read commands (LUN read inquiry, read capacity, etc).  It does this sequentially, at least twice for each disk involved in the virtual machine.

With the default TCPAck time of 200ms, for each SCSI Read command issues by the cluster node, the payload of the command is 10bytes (on wire is 66bytes).  The SCSI Read command does not fill a TCP segment and so while the TCP payload is sent to the storage controller and the controller responds, the node waits to send the ACK until a segment on that NIC is filled or hits the max ACK time of 200ms.

So, let’s say I am the cluster disk driver and want to read LUN metadata as part of onlining a cluster disk…
…I send the command via the iSCSI Initiator, Storage Controller responds…wait….wait….wait…TcpAckFrequency trigger fires after a max of 200ms….only then is my ACK sent to storage….and the TCP transmission completes and the data is returned to the Cluster disk driver process.

This means for each SCSI command we attempt to send to storage (target and target LUN), we will typically end up waiting for the 200ms timer!  This is regardless of how busy SCSI data I/O is because this algorithm has Winsocket-based granularity.

In a Windows iSCSI cluster, this issue really comes to light because the cluster operates (controls/validates) on resources in the cluster in a sequential manner.
So, for say 65 LUNs, each LUN takes a significantly longer time to validate/control because the duration elongation due to the 200ms ACK timer happens multiple times per LUN.

Storage response times should be in the sub-10ms range, not 200ms 🙂

This is also why I did not see this issue in Fibre Channel environments.

So, while setting the TCP ACK frequency to hardcode to 1ms per-NIC helps, it may have adverse performance implications in terms of wire and Storage Port congestion.

But…for iSCSI Networks, this should not really be of concern because by best practice, you should have isolated iSCSI networks with minimal hops between host and storage.

 

The real fix I believe is by using the TCP_NODELAY option in SockOPTNS for the iSCSI Initator.  This by-passes the Nagle Algorithm completely for that process, so you dont need to even wait for that 1ms (seems a short time, but it is still a trigger time) – the SCSI command will fire immediately.

I have a design change request in for Microsoft to consider this as the way forward. Probably wont see it until Windows 8, but hey!

How to set the TCPAckFrequency on your iSCSI NICs

Regedit – backup your registry – before proceeding 🙂

Subkey: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<Interface GUID>
Entry: TcpAckFrequency
Value Type: REG_DWORD, number
Valid Range: 0-255
Set to 1.

Do this only for your iSCSI NICs, unless directed otherwise.

Hope this helps

James.


EMC Live! WebCast Series – SharePoint

March 4, 2010

Hi Folks

Back again – been a while, aye but have been working on some exciting stuff, which I will share with you in due course.

So today if anyone has an hour to kill, Eyal and I are presenting an EMC webcast (part 1 of a 4-part series) on SharePoint and touching on what’s coming in SharePoint 2010.

http://info.emc.com/mk/get/DBM6473-4099_raf_lp?reg_src=JamesB

 

Today will be

Thursday, March 4, 2010 – 12 pm PT / 3 pm ET
Learn how to design your SharePoint infrastructure to ensure optimal performance and scalability, as well as leverage the benefits of virtualization.

Part 2 is Dave from our SourceOne engineering wing

Thursday, March 11, 2010 – 12 pm PT / 3 pm ET
Find out how to mitigate risk, reduce costs, and improve SharePoint performance with EMC SourceOne.

Part 3 is Eyal and I again

Thursday, March 18, 2010 – 12 pm PT / 3 pm ET
Learn how to design, deploy, and manage your SharePoint infrastructure to ensure availability and rapid recovery, as well as understand what options are available—from native SQL Server functionality to array-based replication.

Part 4 is another Dave from another engineering wing, EMC Documentum
 
Enhancing SharePoint to Meet Your Information Management Needs

Thursday, March 25, 2010 – 12 pm PT / 3 pm ET
Discover how EMC Documentum integrates SharePoint into your broader information infrastructure, enabling you to cut operational costs and reign in server sprawl.

Hope to see ye there !

But I will share the content with ye later anyways 🙂

Thanks

James,


Poll: What SQL recovery model do YOU use for your SharePoint databases?

November 16, 2009

I’d be very interested to know what kind of SQL recovery model do YOU use with your SharePoint databases.

The poll is anonymous, so feel free to be honest!

If you use multiple recovery models, specific to the SharePoint database type, click both options.

Cheers!

James.


SharePoint and SQL Databases

November 15, 2009

Hey folks,

This week I am at presenting SharePoint and Hyper-V information to both EMC and Microsoft personnel @ the Microsoft campus, Building 33 in Redmond, WA.  Strangely, its not raining! 
I will be covering topics, such as best practices, Hyper-V virtualization, backup and recovery and DR.  I hope to share these presentations with you once the conference is over, so stay tuned.

On to the real topic…SharePoint & the proliferation of SQL databases.

SharePoint’s main stay of information is in the form of SQL databases. 

In a typical SharePoint SQL Server I would categorize these databases in the following four layers;

===SQL System Databases   (created when SQL is installed)
           -Master, Model, MSDB, TempDB

===SharePoint configuration Databases   (created when SharePoint is installed)
           -SP_Config, etc

===SharePoint content databases   (created at the end of SP install, portal & content creation)
           -WSS_Content_* (Portal), 
            SharePoint_AdminContent_* (Central Admin)
           User-defined content databases (eg ContentDB01, 02, 03)

===SharePoint Shared Services Provider databases   (created with SSP & application configuration)
           -SharedServices_DB  (SSP Configuration database)
           Shared Services applications 
                 -Search
                 -SharedServices_Search_DB   (actual Office Search “Osearch” database)
                 -WSS_Search_{hostname}       (WSS SPsearch DB – per host)

You need to follow standard SQL best practices, including storage BPs to ensure good SharePoint performance, granular backup and recovery and efficient disaster recovery.  Agreed.
(I will go into more best practices for SharePoint SQL storage in a seperate post, let’s stay with this for now)

 

SharePoint does not allow the user to specify where the SQL database data and logs files should reside, and so the default database locations will be used.  The default databases data and log files locations are a part of the SQL instance configuration.

These are recorded in the registry, specific to the SQL instant, e.g.
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.$$INSTANCE$$\MSSQLServer\
             \DefaultData               (Default database data file location)
             \DefaultLog                (Default database log file location)

Easier is to use SQL Management Studio
   –  Right click the SQL instance, Properties, Database Settings -> Database default Locations.

Today, we have two choices in ensuring that SharePoint SQL databases are in the right locations…

1) Change the default SQL data file location prior to the SharePoint configuration task (e.g. create a SSP)
      or
2) After the SharePoint configuration task, bring down SharePoint hosts, detach, relocate and re-attach the SQL databases.

I perfer #1 myself 🙂

So, here are the recommended sequence of steps to take

Recommended sequence

1) Install SQL with advanced options
  -ensure that master, model, msdb locations are correctly set.  
  -ensure that tempdb is on different LUNs, ideally filegroup the datafiles

2) Change the default database file locations to your SP Configurations volume
          -then install SharePoint.

3) Change the default database file locations to your “basic content” (or SP Configurations) volume
         -then create your SharePoint portal(s).

4) Change the default database file locations to your SSP & Search Configurations volume
         -then create your SharePoint SSP and add SPsearch roles to hosts

5) Change the default database file locations to your SSP Search Database volumes
         -then create your SharePoint SSP Search application and associate a content source

 

Now, user-level content databases is trickier…
You dont want to have to follow this procedure every time as many content databases will be created in time. 

My recommendation would be;

1) logged in as the SharePoint system account in SQL, manually create content databases in the right storage locations
2) then use either Central Admin or STSADM to attach an existing SQL Content Database

  a) Central Admin way 
    – Central Administration > Application Management > Content Databases
       – specify the name of the existing SQL content database

     or

     b) Stsadm way – stsadm -o addcontentdb -url (URL) -databasename (ContentDB name) -databaseserver (SQL name)

     Example
  stsadm -o addcontentdb -url http://portal.sps.com/site01 -databasename ContentDB01 -databaseserver SQL1

You should not need to specify the username/password as you will use a trusted connection within your domain.

I am putting in an enhancement request to Microsoft to allow SharePoint admins to specific the directory locations from Central Admin/STSADM/Powershell in future.

Every so often, especially with dispersed power users (capable of creating content databases), full audits of SharePoint database files should be carried out.  It is vital to ensure that all databases are protected.

Hope this helps people
James.


Fancy having fully-automated site Disaster Recovery?

November 4, 2009

A critical application…
…nearly as much $$$ spent on your DR site as your Production site…
…hours of time spent on defining & refining your disaster recovery procedure…
…never mind the training & updates from testing…
…varying bandwidth between sites, sometimes not enough…
…want to be able to recovery your SQL server back to 10:22:16…
…and that was 2 days ago…
…need that data back within minutes…

Well you can…

EMC RecoverPoint, EMC’s CDP (Continous Data Protection) product provides near instantanous roll-back capability.

 The majority of Operating Systems and applications are supported.  Data reduction* and WAN bandwidth compression – native.  Non EMC storage arrays – supported.  Long distances – how about 2000km? – supported.

Combine RecoverPoint and EMC Cluster Enabler together (called RP/CE for short) and the solution gives you just this.

RP/CE is what is called a geographically dispersed clustering solution, but alot more…
…RP/CE allows one or many clustered applications or Hyper-V VMs to be failed over in a minimal amount of time to a disaster recovery cluster node(s)   – in my Proven Solution – try 3 production / 3 DR nodes. 

I could even run some VMs on the DR cluster nodes (if your network supports this of course) and some on Production.  RecoverPoint/CE doesn’t mind, it supports bi-directional replication. 
EMC understands nowadays DR sites are too costly to just leave idle. 
RecoverPoint natively supports bi-directional replication in the same RecoverPoint installation.

The slick part of RecoverPoint/CE is that once operational, all the user needs to know is how to use normal Microsoft clustering and cluster administrator console.  RP/CE adheres to all Microsoft failover clustering requirements.  It is installed as a clustered resource and is added to each cluster group which needs RP/CE’s protection.

Say, you have a 2-node (1 active / 1 passive) Hyper-V server running your virtualized SharePoint farm.  Down goes your production site.  Within a few minutes, all your SharePoint virtual machines are up and running on the DR side again, with the latest image of your data.  It’s that simple.

I was at the EMC booth as SharePoint Conference 2009 ( SPC09 ) in Las Vegas in October showcasing this and I must say…visitors were very impressed…most especially the folk who have been through the pain of recent DR planning..

* as a classic example, in something like an OLTP environment, if an 8k SQL data page is changed by say 200 bytes, the entire 64k block is written back down to filesystem.  Without RecoverPoint’s data reduction, all 64k of data is shipped across the wire to the DR site.  With data reduction, only the 200 bytes plus some checksum data is sent across the wire – AND compressed! clever.

For a recorded demonstration of the Proven Solution I am working on please see below;

In this demo, I would like to show you the power of EMC RecoverPoint and Cluster Enabler (RP/CE) in providing fully automated Disaster Recovery in your environment. In this use case, a busy enterprise SharePoint farm hosting 240,000+ busy users will expect a full site disaster and RP/CE automates disaster recovery of the farm in minutes.

The environment consists of a 6-node Hyper-V cluster, (3 Active/3 Passive) using iSCSI connectivity to an EMC CLARiiON CX4-240 storage array

I will share more information on this solution as it evolves.

Some more doc resources 

EMC RecoverPoint/Cluster Enabler – a detailed review
Disaster Recovery for Windows using EMC RecoverPoint/CE
EMC RecoverPoint/SE for CLARiiON Cx4
DR in a geographically dispersed cross-site virtual environment

I would love to get some feedback on what people think of something like this, not necessarily this EMC solution, but geo-clustering in general…
If anyone would like some serious detail into how RecoverPoint/CE works, I can gladly provide same as a blog post.

Thanks, James.