How to improve your IT. Part 10 – What Infrastructure risks are you carrying?
A series of posts on how to improve the performance of your IT
There is a saying that CIOs lose their jobs because of bad Infrastructure Managers. I add to this that the Infrastructure Managers are bad because they fail to understand the basics and that the devil is in the detail. I have managed many Infrastructure departments small and large, and whilst I can say that they are indeed a challenge, if you put the basics in place, they are easy to manage. IT Infrastructure is complex and critical to all IT operations, consisting mainly of Service Delivery, Installations, Maintenance, SOEs, Server and Desktop refresh strategies and Networks and Communications. Pay attention to the basic needs of these and things look after themselves.
The Best practice Standard is 99.9% Server and Network availability, a hardware fleet upgrade strategy driven by applications capacity needs, response time objectives, systems capacity management requirements, hardware failure rates and fleet ageing, A systems software upgrade strategy and desktop refresh strategy is also required.
Ask your Infrastructure manager to conduct this Performance Assessment, think of it as a health check. It is the assessment I have always conducted in my early days as a CIO or Infrastructure Manager as it quickly lets me understand the degree of risk I am carrying. Try it out for yourself.
Performance Assessment
1. Servers
The results of this Server hardware audit are also used to check on the accuracy and completeness of the IT Budget and provide input into the next IT Strategy (Risk analysis).
Actions
Record on a spreadsheet all the following. (By unit or group by type. Show total numbers).
List all systems servers.
List all web servers.
List all applications servers.
List all e-mail servers.
List all back-up servers.
List all other server types in use.
List all server management software in use.
List all maintenance/support agreements in place.
Rate all of the above as either (H, M, L) risk based on probability of failure.
Does the budget reflect all software licensing and maintenance costs for the above items?
Questions to ask
What server recovery processes exist?
What is the average production server’s failure rate? (unplanned shutdown).
What is the average production applications server’s failure rate? (unusable to users).
How is server resource utilisation managed? (CPU, memory and disk-space (used and free))
Is there a formal process for server physical and logical installation?
Is there a formal manual or automated process for server recovery?
When was the last test of restoration from a back-up completed?
How are back-ups confirmed as complete?
How often are full image restores used?
Is the reinstallation of systems and applications software manual or automated?
Is resource utilisation trending in place for critical systems?
Are all servers included in the depreciation schedule?
Are all servers covered by a maintenance agreement?
Is critical infrastructure covered by high priority maintenance agreements?
Has the infrastructure disaster recovery plan been tested?
Is there a server refresh strategy in place?
2. Other hardware
Actions
Record on a spreadsheet all the following.
List all desktops (by unit or group by type).
List all routers (by unit or group by type).
List all switches (by unit or group by type).
Based on failure rates or equipment age, rate all the above items as either (H, M, L) risk.
Questions to ask
How effective are desktop service delivery and repair procedures?
How quickly can a router or a switch be replaced?
What are the router and switch failure rates?
Is critical infrastructure covered by high priority maintenance agreements?
Are patches up to date?
Is there a desktop refresh strategy in place?
3. DBMS
Questions to ask
How many staff are involved with database administration?
Are there user account and share management procedures in place?
Administration services for active RDBMS in use?
What daily housekeeping procedures are in use?
How is capacity management, managed?
How is performance analysis/tuning conducted?
Does a systems software upgrade strategy exist?
Who owns and manages software license management?
What vendor support arrangements are in place?
4. Naming standards
Often hardware does not have a formal naming standard, instead, names of planets or mountains or similar are used which is unprofessional and can lead to a variety of problems. The best practice standard is the use of a server, router and switch naming standard consisting of ‘type’ (for servers -web, system, print, production, application), ‘location code’ and ‘incremental number.’ A good naming standard makes it easy to deploy, identify and filter through hardware farms, especially when you may have hundreds or thousands of units deployed. There are important advantages to adopting a formal naming standard, one that scales as the population grows.
It speaks to the professionalism of the IT department.
New staff can quickly learn to identify hardware types.
Mistakes caused by selecting the wrong piece of hardware are far less likely.
Disaster management benefits when staff and third parties need to identify and prioritise the hardware recovery sequence.
5. Tools and utilities
The best practice standard is that all software tools and utilities are vendor supported with an OS, systems software and applications upgrade path. Tools and utilities have a nasty habit of multiplying, especially when they are freely downloadable from the Web. Most technical and engineering staff have their own set of utilities for fixing problems as against a set of approved vendors supported products. The Best practice IT Standard is a vendor-supported, shared set of utilities in order to have confidence that common and consistent outcomes will be produced.
Actions
Record on a spreadsheet all the following.
List all software tools and utilities in use.
List all scripts in use.
List all SOE’s in use.
Rate all the above items as either (H, M, L) risk based on being vendor or non-vendor supported.
Questions to ask
What tools/utility redundancies exist?
What script redundancies exist?
What tools, utilities and scripts can be removed?
Should there be a policy of not downloading products from the Web?
Is software distribution fully automated?
Are production and development SOEs isolated?
Are their redundant SOEs, what can be removed?
Build an Infrastructure Scope of works
Collate all of the Responses gathered into a task list.
Prepare a risk analysis for all hardware and software and DBMS.
Update asset registers.
Update budget depreciation amounts and other asset-related costs.
Review production servers with high failure rates. (unplanned outages).
Review production applications servers with high failure rates. (unusable to users).
Establish server resource utilisation management. (CPU, memory and disk-space (used and free))
Create a process for server recovery by type.
Create a hardware fleet upgrade strategy.
Create a systems software upgrade strategy.
Create a desktop refresh strategy.
Put in place database monitoring.
Determine further works required and scope out.
Breakdown the scope of works to task level, ready for loading into the change management project schedule.
Risk mitigation actions from risk assessment.
Fully automate software distribution.
Remove redundant, tools, utilities and scripts.
Replace non-vendor supported products with vendor-supported products.
Train staff on supported products.
Standardise engineering toolsets.
Investigate standardising on a common naming convention for servers, routers, switches and migrate to over the next six months.
Determine further works required and scope out.
Breakdown the scope of works to task level, ready for loading into the change management project schedule.
Determine further works required and scope out.