Solved: Re: Gateway API cross-namespace routes doesnt work...

adimitrov

Hello everybody... I am struggling for over a week now to make my cross-namespace httproutes working.

I have the following setup (the very same as the official examples):

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: mgmt-internal-gw
  namespace: gateway
  annotations:
    cert-manager.io/issuer: cf-issuer
spec:
  gatewayClassName: gke-l7-rilb
  listeners:
  - name: http-listener
    port: 80
    protocol: HTTP
    hostname: "my.domain.com"
  - name: https-listener
    port: 443
    protocol: HTTPS
    hostname: "my.domain.com"
    allowedRoutes:
      namespaces:
        from: All
    tls:
      mode: Terminate 
      certificateRefs:
        - name: internal-wildcard-cert
          kind: Secret
          group: ""
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-filter-redirect
  namespace: gateway
spec:
  parentRefs:
  - name: mgmt-internal-gw
    sectionName: http-listener
  hostnames:
  - "my.domain.com"
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https
        statusCode: 301
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: argocd-route
  namespace: system-argocd
spec:
  parentRefs:
    - name: mgmt-internal-gw
      namespace: gateway
      sectionName: https-listener
  hostnames:
  - "my.domain.com"
  rules:
  - matches:
    - path:
        value: /
        type: PathPrefix
    backendRefs:
      - name: argocd-argo-cd-server
        port: 8080

I can see all routes and the gateway itself healthy in GCP console. When I try to reach my domain, I got 503 "no healthy upstream". I confirm that service inside (ArgoCD) is working pretty fine - i can access with port-forward or from another pod via curl. If I deploy argocd (or any other service) within the same namespace and have the httproute + gateway + service within the same namespace, all works well. I have also tried to create referencegrants but still the same. I am using autopilot, I have tried with versions 27,28,29. The cluster is all private if that matters...
Do you guys have any suggestion what could be missing here in my setup?

adimitrov

Hello All, I am passing by to tell you that I found the solution myself. The solution is not described in any documentation (dont know why....). If someone here can help me how to give GCP feedback so the documents can be updated?

The solution:

- the exposed services require NodePort, not ClusterIP like in the documentation... Otherwise, the backend healthchecks do not pass. There is even no meaningful alert in GCP UI when browsing the gateway or the routes. You need to go to the services, click on the desired service you want to expose, go to backends, expand advanced options, and then you will see a small alert that the backend cannot access the service... Really annoying.

- the HealthCheckPolicy, GCPBackendPolicy and GCPGatewayPolicy are NOT required! (but ofc they are best practice)

- the ReferenceGrant is not needed at all.

good luck everyone!

View solution in original post

Rfelizardo

Hi @adimitrov,

Welcome to the Google Cloud Community!

It may be challenging to troubleshoot this issue without more visibility but If I understand your issue correctly, you are receiving 503 errors, "no healthy upstream" on your GKE.

HTTP 503 Service Unavailable, no health upstream means that either there are no hosts available to serve the traffic or all hosts have failed the backend health checks.

HealthCheckPolicy, and GCPBackendPolicy resources must exist in the same namespace as the target Service or ServiceImport resource. You might want to checking out this documentation.

Add the Health check configuration on your YAML file as this could ensure that traffic is only routed to servers that are able to respond effectively.

Here's a sample YAML that is available on the documentation provided.

apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: lb-healthcheck
  namespace: lb-service-namespace
spec:
  default:
    checkIntervalSec: INTERVAL
    timeoutSec: TIMEOUT
    healthyThreshold: HEALTHY_THRESHOLD
    unhealthyThreshold: UNHEALTHY_THRESHOLD
    logConfig:
      enabled: ENABLED
    config:
      type: PROTOCOL
      httpHealthCheck:
        portSpecification: PORT_SPECIFICATION
        port: PORT
        portName: PORT_NAME
        host: HOST
        requestPath: REQUEST_PATH
        response: RESPONSE
        proxyHeader: PROXY_HEADER
      httpsHealthCheck:
        portSpecification: PORT_SPECIFICATION
        port: PORT
        portName: PORT_NAME
        host: HOST
        requestPath: REQUEST_PATH
        response: RESPONSE
        proxyHeader: PROXY_HEADER
      grpcHealthCheck:
        grpcServiceName: GRPC_SERVICE_NAME
        portSpecification: PORT_SPECIFICATION
        port: PORT
        portName: PORT_NAME
      http2HealthCheck:
        portSpecification: PORT_SPECIFICATION
        port: PORT
        portName: PORT_NAME
        host: HOST
        requestPath: REQUEST_PATH
        response: RESPONSE
        proxyHeader: PROXY_HEADER
  targetRef:
    group: ""
    kind: Service
    name: lb-service

Make sure you review the restrictions and limitations before deploying GKE resources. Also you might want consider filing a ticket with our Google Support team as they are well-equipped to handle issues like these.

Hope you find this helpful.

adimitrov

Hello RFelizardo,

I do appreciate your help here 🙂 As per your suggestion and all the things described in the documentation, I have also created the Health check policy, backend policy and gateway policy. Here are the specs:

---
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: lb-healthcheck
  namespace: system-argocd
spec:
  default:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 3
    unhealthyThreshold: 3
    logConfig:
      enabled: true
    config:
      type: HTTPS
      # httpHealthCheck:
      #   port: 8080
      #   requestPath: /
      httpsHealthCheck:
        port: 8443
        requestPath: /
  targetRef:
    group: ""
    kind: Service
    name: argocd-argo-cd-server
---
apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: my-backend-policy
  namespace: system-argocd
spec:
  default:
    sessionAffinity:
      type: CLIENT_IP
  targetRef:
    group: ""
    kind: Service
    name: argocd-argo-cd-server
---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: my-gateway-policy
  namespace: default
spec:
  default:
    allowGlobalAccess: true
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: mgmt-internal-gw

Unfortunately the behavior is still the same. I still got 503...
I do not think it is a problem of insufficient nodes because as I mentioned, if I do a port-forwarding or move the service + pods within the same namespace as the Gateway, everything works pretty fine.

One interesting thing is that I have tried to setup my own vanilla Kubernetes deployed on Compute VM's and tested out the cross-namespace scenario, and it seems that it is working. So I guess it is kind of a GKE thing... It becomes really frustrating as GKE promises to be the easiest and slickest way to consume Kubernetes, however...
Gonna try with the support, fingers crossed solution comes from their end.

adimitrov