17 July, 2006

Risk Structure


Risk = Impact * Probability

Probability = Threats – Preventative Controls
Mitigating Controls
So what are the real pieces of this overused and debunked but not disproved equation?

Impact is Significance modified by the different forms of Immediacy

In the entire risk structure the only thing that Digital Security as an organization can reliably affect is the Controls/Management aspect. All other aspects of the process are defined outside of the Security team. We need to identify them and can determine which are in context for us but we cannot change them. Within likelihood the threats are controlled by the originators of the threats. Within Impact immediacy and significance are either inherent or defined by the business needs.

Context

Context – the context is the defined structure that the other portions of risk are compared against and dependant upon.

It defines the scale of review and management of controls.

In short - What are the scope and limitations of the risk management?


•What is being protected/assessed?
–Group, Segment, Business Unit, Project, Plant, Pipeline, …
•What are the applicable impact mechanisms?
–Confidentiality, Integrity, Availability, Environmental, Safety, Liability…
•What are the Applicable threat sources?
–Eve the evil hacker, Joe the employee, Virus, Mafia, Government, Competition…
•What are the acceptable Controls/Management functions?
–How much cost is acceptable?
–How much admin is acceptable?
–Who has Authority?
–Who has Responsibility?
–How does it overlap with other controls? e.g. Failsafes, FC&A, Physical Security

14 July, 2006

Layers

In Depth Protection with Multi Layered System Defense – Jim C

Introduction

From an infrastructure standpoint most organizations rely upon two key defenses to ensure the protection of their essential systems. Solid edge protection using firewalls and updated patching/antivirus form the root of these two key defenses. More advanced organizations have developed elaborate and comprehensive procedural elements to optimize the effectiveness of these protections. Internal firewalls have been implemented to help further protect critical assets such as PCN’s and key datacenters. This protection is essential but unfortunately there are inherent weaknesses in both firewalls and standard patching/AV signature deployment mechanisms that prevent even the most comprehensive programs from being totally effective. Adequate protection against threats in the current environment requires both in depth network protections and multi layered system protections in order to succeed.

Vulnerabilities

Firewalls cannot easily block traffic that goes to legitimate functions (TCP or UDP ports) and most standard deployments are unable to effectively analyze the content of packets to determine its probable impact on the protected end systems. This has resulted in the proliferation of exploits and subsequent worms to take advantage of this weakness such as Sasser (the LSASS buffer Overflow port 1443) and Nachia (the RPC DCOM buffer overflow port 135 or 445). The underlying vulnerability to the exploits expose systems to both worms and difficult to detect (without and IDS) hacking. Since the vulnerable ports are ones that need to be used for normal business transactions it is impossible to block them. Even firewall systems that provide comprehensive packet inspection are often only point solutions and are unable to dynamically adjust the network to an attack. Identifying the problem is only one piece of the solution it is still necessary to stop the subsequent attack.

Comprehensive patching and antivirus signature updates is the most effective way of dealing with the system level vulnerabilities that a firewall is unable to address. In a perfect environment all machines will be 100% patched with the most recent fixes and therefore not vulnerable to remote exploitation. In the real world 100% patching is impossible. Comprehensive AV signature updates is more achievable but 100% coverage in the few hours it might take for a malicious worm to spread after an exploit revelation is still a near impossibility. This is due to several items. Central IT very rarely has direct access to all the machines that connect to the organizations network within 2 to 4 hours of a vulnerability announcement. Some systems are traveling, some are behind firewalls or other protection mechanisms and many machines are not accessible to a central IT organization at all. Inaccessibility might be due to the machine being owned by another entity such as contracting organizations or due to rigorous change control requirements and inability (and/or unwillingness) to allow changes such as the ones that exist on many PCN systems. Even when the central IT organization has direct control of the systems it often requires days or weeks to deploy a patch and newer systems may not be appropriately configured to receive the patch. These elements combine to result in a deficit of coverage that is between 5% of the systems (in very well administered patching environments) to greater than 25% of the systems (in less well administered patching environments).

Risks

When combined, these fundamental flaws result in a significant breakdown of the overall information security of the company as a whole. The resulting risks are significant.

On the confidentiality side the aggregate risk of even 5% of potentially accessible machines being exploitable results in the ability for external hackers to “daisy chain” attacked systems together to effectively bypass many protections such as firewalls. The existence of a significant number of machines that can be compromised at a root level that also contain databases (such as with the LSASS exploit and Sasser) means that this data is readily accessible with minimal effort. Gathering of system information from these machines allows other machines (that are not vulnerable to the original exploit) to also be compromised. The net effect of this is that without other protection mechanisms in place most or even all of an organizations data is open to outside entities that want to make the effort of retrieving it.

Availability impact due to the aggregate risks identified is primarily due to system or network loss due to worms and other viruses. In many organizations this is easily measured by analyzing the impact of previous infections. This should be modified by two factors. The first is that the time between worm release and vulnerability announcement has been shrinking and therefore the risk of occurrence without the ability to respond is increasing. The second modification is that few if any of the worms of the last several years have had an intentional payload. This means that we have not seen anything approaching a worst case scenario. A reasonable scenario would be the unavailability of 10% to 50% of the entire IT infrastructure for up to one week with possible complete unavailability for greater than one day and indefinite data loss of much of the backup gap data. Any information on systems that are without backup processes (such as laptops and user desktops) has the potential of being irrevocably lost for a significant percentage of systems on the entire network.

Integrity risk due to the identified systemic flaws is slightly less catastrophic than the previous two risks. It is primarily due to intentional data manipulation by an undesired source via the mechanisms identified in the confidentiality risk section. Presumably this would be mitigated by working business processes that would identify problems before they reached material levels. In organizations that have a high level of non compliance on patching it is possible that this type of manipulation could be hidden. Organizations that rely on trust and non automated detective controls rather than systemic segregation of duties are more exposed to this risk.

Based on the aggregate of these identified risks it is conservative to place a value of up to 2% of an organizations yearly output at risk with a fairly high likelihood of occurrence. Many organizations have had significant identified outages due to worms and hacking events in the last year. Several have lost more than a full work week for the entire organization. Many have incurred reputation damage and a few have been subject to regulatory sanctions.

Frequent and recurring virus outbreaks highlight the existence of fundamental flaws that might also be avenues of exploitation by other security risks. They indicate a higher level of risk than would exist in an organization without frequent issues.

Solutions

In order to cost effectively protect against the different threats that are prevalent in the working networked environment today it is necessary to defend at multiple locations in the network (in depth protection) as well as multiple layers on a single system (layered defense). Achieving in depth protection relies on existing strategies such as fire-walling and connection authentication as well as newer mechanisms such as Network Intrusion Prevention Systems (NIPS). Likewise protection of systems at multiple layers combines older protections such as access control, patching and Antivirus with newer (relatively) strategies such as centrally controlled host based fire-walling, memory protection and behavioral restrictions that can loosely be identified together as Host Intrusion Prevention Systems (HIPS).

By placing NIPS at key locations it is possible to effectively segregate potential weaknesses in the architecture and to ensure that worst case scenario infections are contained. With careful location selection for the NIPS units it is possible to use the rule of two to dramatically decrease overall exposure with relatively low cost. Simply put, a properly located NIPS can reduce the number of vulnerable systems to a given exposure by up to half (usually substantially less) of the total existing machines. The actual number will be lower than half for each NIPS due to unequal distribution of systems within the overall networks and in most cases multiple access paths.

Unfortunately in order to reduce the risk to below the area of materiality it is often essential to protect individual (or small clusters) of systems. This is where it is more cost effective to deploy HIPS. By comprehensively protecting key systems such as financial application servers and key process control systems using HIPS and placing them on highly controlled and fully protected networks (all systems on the subnet have HIPS installed) total potential loss is limited to outages due to network congestion from systems constrained to small geographic or business regions (by the NIPS infrastructure). Point solutions can be flexible and site specific based on needs and still provide comprehensive protection.

Summary

An organization is at risk of significant/material loss due to catastrophic virus infection and/or undetected malicious activity without action. Existing mechanisms to deal with these threats are helpful and should be supported and expanded but are unable to effectively mitigate the total risks due to inherent weaknesses. Most organizations have already incurred losses repeatedly due to the exposure to this risk. There are architectures and processes available that are able to effectively mitigate these risks. Orgqanizations should investigate these mechanisms, determine the design most appropriate to them and if the cost is commensurate with the loss and risk of potential losses implement the systems.

13 July, 2006

A vision of an Ideal Process Security Environment

What the Operator should have to do
  • Install preconfigured networking hardware
  • Install Primary DCS server
  • Install USB device provided by vendor
    Follow wizard to generate keys
    Lock USB device away just in case
    Follow Wizard to identify Networking hardware and other key settings/trusts
    If desired integrate to MOC process/software for desired level of control
  • Physically Install new PLC’s
  • Goto Configuration screen and accept the PLC’s individually
    Discover devices on legacy PCN and accept them into the system
  • Operate/engineer as normal

PLC’s/Controllers
PLC’s have default communication access mechanisms to ensure that they receive commands from the proper locations.

  • Asymmetric key pair (very likely to hard to administer but still ideal)
    Installed in the factory
    Public key accessible to the purchaser probably within the historian or DCS server via licensing
    Keys can be changed and updated via appropriate DCS server on initial configuration and afterwards as needed
    This is used as an authentication mechanism to ensure that they do not communicate with any other systems. They use SSH or another tunnel to communicate with each other and with the DCS servers to ensure they are not easily subject to redirect attacks.
  • Host level firewall is configured to allow the plc to receive and send communications only in specific ways. All other traffic is dropped without response. This does not need to actually be a firewall it could be done with a customized stack that only allows specific communications.
  • Integrated SNMP (V3), Syslog or similar capabilities for logging and alerts configurable via authenticated trusted source and preconfigured to supply security data to a remote point
  • Log ties changes to authentication source and authority
  • Failsafe settings can (but don’t have to) require local physical action to change

DCS servers

  • DCS servers (whether they are historians or more) have multiple layers of protection all of which have approved (and specifically defined) configurations by the applicable vendors.
    A host based firewall (HFW)
    Integrated communication authentication capabilities tied to the key structure used in the PLC’s and elsewhere in the architecture.
    Integrated signature based IPS capability in the HFW with signatures driven from a trusted authenticated source.
  • Approved AV software with specific recommendation on DAT update mechanisms that are consistent with specific AV vendor methodologies
  • Behavior based IPS with DCS vendor approved configuration
  • Memory Protection/Control
  • Integrated management architecture
    Release management capabilities for servers, all software on them and for associated Controllers
    MOC (management of change) mechanisms with coordinated approval levels for changes on the server, for software and for controllers
    Might (should?) be integrated with AV and IPS update architecture
  • Primary/Secondary DCS security servers
    The primary DCS server serves as the center of the key architecture for the PLC’s and a security aggregation point for interfacing with external security and authentication
    Security functions should be on the normal central DCS server
    Capable of redundant configurations
  • Defined trust structure that will allow integration

Network

The network is divided in to several segments.

  • Firewall (or firewall IOS) controls access to all segments
    Statefull packet inspection
    Signature based NIPS capability
    Secure Remote Monitoring and update capability
    Dynamic redundancy capability
    Power
    Devices (HA, VRRP, HSRP) load sharing not strictly necessary
    Availability biased failure ability for interfaces
    Industrialized/Static safe
    DCS vendor provides specific configurations for integration to their security architecture
  • NAC (or similar mechanism) used to control access to each segment
    NAC splits the segments into two separate VLAN’s Trusted and Untrusted
    Trusted VLAN is home to configured authenticated systems (using the key structure to provide an automated authentication)
    Untrusted PCN has all traffic routed to an initial configuration DCS security server
    (Optional) Default untrusted network for devices that connect that do not even have a manufactures key or similar capability but still have direct
  • PIN Network (PCN DMZ)
    Serves as home to the Historians and other DCS servers with Open loop controlling functions or serving as data aggregation points for external feeds and monitoring
    Provides neutral zone between vendors
    Provides interface capability to control functions
  • PCN Network
    Home to PLC’s and DCS servers with closed loop controlling functions
    Authentication for communication via NAC with the key architecture providing access authentication
    NAC splits the PCN into two separate VLAN’s Trusted and Untrusted PCN
    Trusted VLAN is home to configured PLC’s and systems
    Untrusted PCN has all traffic routed to an initial configuration DCS server
    (Optional) Default untrusted network for devices that connect that do not even have a manufactures key or similar capability but still have direct control functionality
    Separate PCN's Possible for Redline (highly critical or Safety essential) systems
  • ESD Network
    Used as protected network for Emergency Shutdown PLC’s and associated servers/services
    Very tightly controlled access
    All changes logged, documented and tied to an engineering authority
    Home of the key fail safe mechanisms
  • (Optional) Monitoring Network
    Home of controllers that have monitoring only capability and do not participate in closed loop controlling functions
    Servers that provide outgoing data for troubleshooting and performance management
  • (Optional) Utility Network
    Home to support server and systems that need integration with DCS systems but serve no actual control functionality
  • (Optional) Legacy Network

How it could work

The organization installs and configures the networking equipment in accordance with DCS vendor recommendations leaving a legacy LAN (or LANs) for existing equipment. The Primary DCS security server is installed and configured with the organization providing (or generating) its top level key pair (and backing it up securely). Network authentication is configured to the server. New controllers are connected to the PCN or Monitoring network. They try to authenticate to the network and either succeed based on preconfigured factory keys or fail and are routed to a secure server that will use the vendor default key to tell them they need to update their key pairs to ones provided by the Primary DCS security server. This could be automated or the new devices could show up in "unidentified" list that requires an operator to permit key distribution. The configured and identified controllers send/stream log data to the DCS security server along with their normal traffic. If the controller does not have the capability to handle a key its MAC is used to assign it to a legacy PCN and allow future access from that separate controlled VLAN. Controller software and possibly firmware updates are periodically checked and updated (after engineering authority approval) from the Primary DCS server. Trust relationships are strictly controlled and limited to information access in default settings. All setting changes are logged. All setting changes can be configured to require a vote for permission from the system authority. Different levels of change capability for operators, administrators and for MOC approval.