Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Enhancement to outline path for network policies for all core components #1720

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Nov 26, 2024

This enhancement outlines how we will add network policies to all core components in OpenShift. It also outlines how it will eventually become enforcing, and how we will test compliance.

@openshift-ci openshift-ci bot requested review from danwinship and trozet November 26, 2024 17:38
@knobunc knobunc changed the title Enhancement to outline path for network policies for all core components WIP Enhancement to outline path for network policies for all core components Nov 26, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
Comment on lines 47 to 49
- As an administrator, I need to be able to override specific
OpenShift policies to be more restrictive so that I can satisfy my
security department
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, this should not be necessary. If a given restriction is safe to apply, then it should be part of the default policy. And in that case, anything "more restrictive" would be something that would cause cluster functionality to break in some way. IMO we shouldn't allow admins to break functionality they aren't using, we should allow them to disable that functionality instead (and then they don't need to add NetworkPolicies restricting its use).

One specific case I can think of where this might apply is admission control webhooks / validation webhooks. The apiserver needs to reach out to any namespace that contains such a webhook, so by default, we'd have to allow the apiserver egress access to all namespaces (or at least, all "system" namespaces?). Administrators might want to disable that. But should that be expressed as "kube-apiserver can reach all namespaces by default but administrators can create additional NetworkPolicies to restrict that", or should we instead have kube-apiserver-operator just automatically add NPs for all and only the namespaces that contain webhooks? Or should we have a rule like "webhooks can only exist in namespaces with the label foo"?

(Also, FTR, there is no way to implement "kube-apiserver can reach all namespaces by default but administrators can create additional NetworkPolicies to restrict that" given the semantics of NetworkPolicy; we would have to say that administrators are allowed to delete the "kube-apiserver can reach all namespaces" NetworkPolicy.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want administrators to create AdminNetworkPolicy to express their exceptions (where they can deny).

Ultimately, I would love for operators to have a way to have config that allows them to create policy to restrict things like webhooks. They may need an allow-list of valid domains, or something.

But, for the shorter term, we already have people deploying their own Network Policy (not admin, it wasn't ready then) to restrict our stuff since we have no policy today.

enhancements/network/core-network-policies.md Outdated Show resolved Hide resolved
enhancements/network/core-network-policies.md Outdated Show resolved Hide resolved
1. Change the namespace admission controller so all namespaces with an
openshift- prefix (since those are special anyway) are labeled with
the “openshift-namespace’ label so **that network policy can address
them**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget what the rule about OLM operators and openshift- is, so maybe you should explain that somewhere. (In particular, it seems from some of the other comments that OLM operators can make openshift- namespaces, but then how does that work with "(Not) Adding policies for operators not created by Red Hat"?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule is horrible.

They get installed into a default openshift-operators namespace, but the OLM will need to own the policy for that namespace. I suspect it will need to be wildcard allow egress and ingress.

But, worse, the admin can chose any namespace to add it to... there are ongoing conversations about how to deal with this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Former OLM team member here (recently moved to control plane group) 👋

OLM installed operators probably won't be able to specify NetworkPolicy resources in their bundles according to https://olm.operatorframework.io/docs/tasks/creating-operator-manifests/#packaging-additional-objects-alongside-an-operator

OLMv1 is supposed to GA in 4.18 and will, initially, respect a subset of the existing bundle format. The desire is to eventually support bundling formats that allows operator authors more control as to the contents packaged within their bundle (OLMv1 is going to be unopinionated in what is in the bundles) meaning operator authors will eventually be able to specify NetworkPolicy resources in their bundles.

With the current bundling format it seems like OLM would need to be updated to either:

  1. Allow NetworkPolicy resources in the bundle alongside the CSV. IIUC, to achieve compliance with the high-level goal of this proposal this would likely mean updating all bundles in the catalogs to contain a sufficient NetworkPolicy resource. Some open questions for this approach:
    • Do we have enough control of the catalog curation pipelines to ensure this is done properly?
    • Should a wildcard NetworkPolicy be added to all bundles by default if the authors of the operator can't/won't add one themselves?
  2. Stamp out a permissive NetworkPolicy resource on each operator installation.

With the focus on OLMv1 and some general attrition, there are not many folks on the OLM team left with knowledge of the legacy "OLMv0". Both of these are likely not trivial asks of the OLM team.

I do wonder if having CVO manage a default NetworkPolicy for the openshift-operators namespace, that is permissive enough for all operators installed in that namespace, is sufficient enough to achieve the compliance we are looking for even if admins can install an operator in any namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Operators running in the openshift-operators namespace may need network access to just about anything, so I agree with @knobunc that OLM will likely end up needing to ship with an allow-all policy.

Beyond that, I am pretty leery of any solution that requires operators to change their bundle contents, as this will effectively block the use of older versions of $everything.

I'm not clear on whether there's a mechanism that allows the cluster admin to install/override/configure the NetworkPolicy in an impacted namespace. I see a knob to shut the whole thing down. And this section hints at it, but I can't tell what actor (cluster admin or developer) or environment (dev/test or production) it's talking about. Anyway, it seems like a per-namespace "escape hatch" would help mitigate this issue, as it would allow operators to publish documentation describing how to retrofit policies for legacy versions.

enhancements/network/core-network-policies.md Outdated Show resolved Hide resolved
connections that a namespace uses should make future development of
Admin Network Policy easier. We will be using Admin Network Policy
to allow cluster admins to override our namespace Network Policies
if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is ANP a Non-Goal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ANP as part of the solution, either created by the cluster admin to override, or by our operator with well-known labels that their namespaces can have applied. BUT I don't want to require the components to develop ANP. Partly because I can't work out how to get them added to a cluster... I want the components to manager their own policies and not have a central ANP repositiory. That way the operator can upgrade the policy as needed to ensure that they can seamlessly upgrade. The operator may need to install new policy before an upgrade and then remove the old policy once complete.

enhancements/network/core-network-policies.md Outdated Show resolved Hide resolved
enhancements/network/core-network-policies.md Outdated Show resolved Hide resolved

We also need to provide this information to the support organization
in a must-gather. We will augment the must-gather tool to include
information about network policy blocks in OpenShift namespaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that sufficiently-broken policies could interfere with both the Network Observability tool and must-gather.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can catch at least some of that in automated testing. I am assuming that these are the RH shipped policies, not random ones that a customer applied to us.


Since nothing needs to use the labels immediately, there is no concern about ordering and racing, as long as the change happens during the upgrade.

At which point, the cluster administrator can use the new labels to easily apply default policies to all of the OpenShift namespaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do the new policies get created? How do you ensure there won't be problems when some namespaces have policies and some don't, and some have labels and some don't?

In my earlier proposal (#613) I suggested we'd have to roll this out over several releases to avoid problems...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely take a while to roll out since each team will need to add policy to their namespaces. And I suspect they will need assistance developing the policy, which is where ACS may be able to help for static analysis of objects and dynamic analysis of trafic and it will suggest policies based on that.

…nents

This enhancement outlines how we will add network policies to all core
components in OpenShift.  It also outlines how it will eventually
become enforcing, and how we will test compliance.
First round of updates in response to review comments, and some other
clean-up.
@knobunc knobunc force-pushed the core-network-policies branch from 0244f94 to 25b3070 Compare December 3, 2024 18:57
Copy link
Contributor

openshift-ci bot commented Dec 3, 2024

@knobunc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 25b3070 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2025

## Proposal

1. Change the namespace admission controller so all namespaces with an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new, openshift/kubernetes admission plugin is probably a simpler implementation. This will not work on HCP.

Use ValidatingAdmissionPolicy to prevent the removal of the label on openshift-* namespaces.

Comment on lines +110 to +112
3. Work out how we can identify connections blocked by policy in
OpenShift namespaces (either egress or ingress) and work out how to
include it in a must-gather
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical feature for development and field debugging. Very glad to see it here.

Comment on lines +168 to +169
4. If it does not start with `openshift-` then any
`security.openshift.io/openshift-namespace` label will be stripped out and can not be set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mutate the input, but use ValidatingAdmissionPolicy to cause creation to fail to prevent this in newer clusters. Having the cluster-network-operator detect a violation of this rule and prevent upgrade until the cluster-admin corrects the problem is a good idea.

3. Otherwise it will strip the `security.openshift.io/openshift-namespace` label

TODO: Decide if this is the correct behavior... do we want to allow
namespaces to opt-in to being part of the platform. I do not think
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our platform namespaces are openshift-* namespaces. That's how things like monitoring recognize them. To be a part of the platform requires using that prefix.

1. Document what traffic flows (ingress and egress) need to be
allowed for pods in the namespaces

2. The ACS product can be used to analyze workloads in a cluster to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't deploy ACS in CI. When ACS isn't installed, what options are available?

trying to talk to whom and mint tight network policies to reflect
that model.

4. The Network Observability tool can be used to detect when a network policy blocked traffic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network observability isn't available by default. Without network observability, how will we identify and resolve issues?

Comment on lines +280 to +282
Are there any unique considerations for making this change work with Hypershift?

No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there is no impact to having the kube-apiserver run outside the cluster and reach into the cluster to call things like admission webhooks. How about a specific admission webhook and CRD conversion webhook test to confirm that there is no HCP impact?

Comment on lines +299 to +302
How does this proposal affect MicroShift? For example, if the proposal
adds configuration options through API resources, should any of those
behaviors also be exposed to MicroShift admins through the
configuration file for MicroShift?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microshift doesn't run the cluster-network-operator. How will upgrades of microshift get the appropriate namespace labels? How will upgrades prevent non-openshift-* namespaces from squatting on the label?

Copy link
Contributor

openshift-ci bot commented Jan 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from knobunc. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants