Thursday, 30 July 2009

Samba, Squid and Active Directory authentication

This post is the end of a few weeks of challenging debugging.

At $WORK we are implementing a new proxy server based on Squid. Unlike our old proxy, we want to authenticate each user against Active Directory. In order for this to work, Samba (or more specifically, the Winbind component of Samba) needs to be configured.

Getting Samba setup

Consider that Windows networking in the Active Directory world is built on DNS for name resolution, Kerberos for authentication, LDAP for directory services and the SMB protocol for file, print and RPC.

The first step was to configure the proxy server to use the AD Domain Controllers for DNS resolution. This was done by editing /etc/resolv.conf and configuring /etc/nsswitch.conf.

The second step was to get Kerberos working. I've detailed this in another blog posting. This also needs to point to the AD Domain Controllers for the KDC.

Configuring Samba was also fairly straightforward. There was no need to run smbd (used for file and print serving) or nmbd (the naming service) as this box would not be performing those roles. The winbindd server needs to be running. This is responsible for authenticating against the Active Directory.

The smb.conf can be small:

[global]
workgroup = CSS
realm = CSS.AD.EXAMPLE.COM
server string = Squid Proxy and Samba Server
security = ADS
log file = /var/log/samba/%m.log
max log size = 50
socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192
dns proxy = No
winbind:ignore domains = MAT LPS LAB MMSC GRP IMCR UPGRADE CENTRAL 4THFLOOR AD CSSDEV NAS
hosts allow = 10.1, 127.


To join the domain, run the net ads join command, using the credentials of a domain administrator. You should be able to confirm whether the trust worked by typing:

# wbinfo -t
checking the trust secret via RPC calls succeeded


If that's okay, try pulling in a list of users, again using wbinfo but this time using the -u flag:

# wbinfo -u

And this is where things started to go wrong for me. Sometimes it would work, but most of the time it would error. This took a lot of investigating and the details can be found here in the Samba mailing list archive. It was this bit that took the time to debug.

Getting Squid working

Having got wbinfo to now return the list of users, it was time to configure Squid to use ntlm_auth (which in turn uses winbindd to perform the authentication request). The /etc/squid/squid.conf needs the following:

auth_param ntlm program /usr/bin/ntlm_auth --helper-protocol=squid-2.5-ntlmssp
auth_param ntlm children 5


At this point everything should work wonderfully...

Except Squid was unable to authenticate, complaining about the permissions on /var/lib/samba/winbindd_privileged. It appears that Winbind expects this directory to have 750 permissions with root:root ownership, while Squid runs as the "squid" user. According to one post I read about the issue, it may be caused by the way that Red Hat have built Squid. One possible workaround is to use an ACL (yes really, a use for ACLs in Unix!) but it appears my install doesn't have ACL support enabled(!).

So the immediate workaround for me is to create a script that basically does the following:

  • Set permissions on /var/lib/samba/winbindd_privileged to 750
  • Start Winbind
  • Set permissions on /var/lib/samba/winbindd_privileged to 777 (yes, I know...)
  • Start Squid

This appears to work okay until I can find a real solution.

So although this is now working, it's taken longer than anticipated. The thing about Samba, and Active Directory integration is that it's so complicated with so many options. The learning curve is steep, but I'm starting to feel I now have a grip on it.

Friday, 17 July 2009

Configuring Kerberos on CentOS 5

Kerberos is a ticket-oriented authentication system that was originally designed for Unix networks, but was also embraced (and extended) by Microsoft in Active Directory. I've been debugging a number of issues involving the Squid proxy server on Linux using Samba to authenticate against Active Directory, and as part of this I had to get familiar with Kerberos.

It's not trivial, so I've documented my workflow here. Hopefully it will useful to others.

The test environment consists of two virtual machines running CentOS 5, imaginatively named centos01 (krbserver) and centos02 (krbclient). For the purpose of this test, centos01 is the Kerberos server and centos02 is the client.

I followed the instructions here and broadly recommend them. These are my additional notes to clarify some parts of the install.

General notes

Make sure that you use the same time source for both client and server. I used NTP to keep the two VMs in sync. The notes do state this but it's worth stressing.

Remember how many IT problems are caused by name resolution errors! Make sure you have both the server and client registered in DNS (or have entries in /etc/hosts). If using /etc/hosts both the standalone hostname and the FQDN should be added:

192.168.192.26 krbserver.local.zone krbserver
192.168.192.108 krbclient.local.zone krbserver


Note the order of the hostname and the FQDN! This is important (see further below).

Configure the server

After installing the packages using YUM, configuring the database and ACL file, adding the first principal user and starting the three services, the server should be ready to go. Confirm this with kinit and klist. Now it's time to configure the client.

Configure the client

Install the packages using YUM and then run the kadmin command and add a new principal for the client machine. It's worth noting that this should be done using the kadmin interactive interface instead of trying to put the "addprinc" parameter on the command line. This is because the -randkey option will be interpreted by kadmin on the command line as "-r andkey" and it will try and authenticate against the "andkey" realm. So for me, the command looked like:

# kadmin -p julian/admin@LOCAL.ZONE
Password for julian/admin@LOCAL.ZONE: ********
kadmin: addprinc -randkey host/krbclient.local.zone


I assume that this is rougly analogous to adding a machine to an Active Directory domain.

Once this entry, export the principal to the workstation's /etc/krb5.keytab file.

In addition to the machine principal, I also created a normal (non-admin) a local user, julian@LOCAL.ZONE. On the client, I log in as my own non-root user ("julian") and type kinit:

$ kinit
Password for julian@LOCAL.ZONE: ********


If this succeeds, you should see the "ticket granting ticket" be assigned:

$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: julian@LOCAL.ZONE

Valid starting Expires Service principal

07/17/09 11:14:20 07/18/09 11:14:20 krbtgt/LOCAL.ZONE@LOCAL.ZONE

Kerberos 4 ticket cache: /tmp/tkt500
klist: You have no tickets cached


This process shows that communication between the client and server using Kerberos is successful.

Configuring telnet (for testing)

On the server, I then enabled the krb5-telnet service in /etc/xinetd.d and started xinetd. On the client, I then ran:

$ /usr/kerberos/bin/telnet -a krbserver
Trying 192.168.192.26...
Connected to krbserver.local.zone (192.168.192.26).

Escape character is '^]'.

[ Kerberos V5 refuses authentication because telnetd: krb5_rd_req failed: Key version number for principal in key table is incorrect ]

[ Kerberos V5 refuses authentication because telnetd: krb5_rd_req failed: Key version number for principal in key table is incorrect ]

Password:


Problem. It was asking for a password which implied the Kerberos ticket was not being passed correctly. However, when I ran klist, it showed that the ticket for the host was passed correctly:

$ klist
Ticket cache: FILE:/tmp/krb5cc_500

Default principal: julian@LOCAL.ZONE


Valid starting Expires Service principal

07/17/09 10:38:20 07/18/09 10:38:20 krbtgt/LOCAL.ZONE@LOCAL.ZONE
07/17/09 10:38:31 07/18/09 10:38:20 host/krbserver.local.zone@LOCAL.ZONE


After running strace against the telnetd process, it appeared that the telnet server was failing when trying to read /etc/krb5.keytab. But all the documentation I had read stated that this should be run on the client and not the server. So, why does the Kerberos server need a keytab file?

Answer: The Kerberos server does not require a keytab file, but the telnet server does! Although they are both running on the same VM, the telnet server is itself a client to the Kerberos server. Simple when you work it out it would have semantically been easier to understand if my telnet server been different from the Kerberos server.

So I ran the kadmin command on the server and created a keytab file using the ktadd command. I restarted the Kerberos services for good measure and cleared my client and server caches using kdestroy, restarted xinetd and tried the telnet:

[julian@krbclient bin]$ ./telnet -a krbserver
Trying 192.168.192.26...
Connected to krbserver.local.zone (192.168.192.26).
Escape character is '^]'.

[ Kerberos V5 accepts you as ``julian@LOCAL.ZONE'' ]

Last login: Fri Jul 17 10:48:10 from krbclient
[julian@krbserver ~]$


Result!

Configuring SSH

The instructions state that GSSAPIAuthentication and GSSAPIDelegateCredentials need to be enabled. I did this and restarted the SSH daemon with -ddd (debug) enabled.

The first attempt at running ssh krbserver prompted for a password, but the server debug revealed the following:

debug1: Unspecified GSS failure. Minor code may provide more information
No principal in keytab matches desired name


Okay, so this is weird. Checking the output of klist showed this:

[julian@krbclient ~]$ ssh krbserver
julian@krbserver's password:
Connection closed by 192.168.192.26
[julian@krbclient ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: julian@LOCAL.ZONE

Valid starting Expires Service principal
07/17/09 13:42:27 07/18/09 13:42:27 krbtgt/LOCAL.ZONE@LOCAL.ZONE
07/17/09 13:42:33 07/18/09 13:42:27 host/krbserver@


Kerberos 4 ticket cache: /tmp/tkt500
klist: You have no tickets cached



Note that krbserver@ has no realm. This turned out to be because /etc/hosts (on the client) looks like this:

192.168.192.108 krbclient krbclient.local.zone
192.168.192.26 krbserver krbserver.local.zone


Putting the hostname after the FQDN like this:

192.168.192.108 krbclient.local.zone krbclient
192.168.192.26 krbserver.local.zone krbserver


fixes the problem!

[julian@krbclient ~]$ ssh krbserver
Last login: Fri Jul 17 13:54:40 2009 from krbclient.local.zone


Klist now shows:

[julian@krbclient ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: julian@LOCAL.ZONE

Valid starting Expires Service principal
07/17/09 13:42:27 07/18/09 13:42:27 krbtgt/LOCAL.ZONE@LOCAL.ZONE
07/17/09 13:42:33 07/18/09 13:42:27 host/krbserver@
07/17/09 13:54:38 07/18/09 13:42:27 host/krbserver.local.zone@LOCAL.ZONE


Kerberos 4 ticket cache: /tmp/tkt500
klist: You have no tickets cached


Summary

What you see above does not include the time spent trying things out and staring blankly at the screen. Getting Kerberos up and running is not the most trivial process and while there is some decent documentation, there are also a lot of people posting questions and asking for help when it doesn't work properly. Hopefully this will shed some light on it for others.

Sunday, 12 July 2009

The future is... iSCSI?

I've recently completed the business planning infrastructure requirements for 2010 at $WORK. As part of this I have specified a new fibre channel switch and some additional fibre attached storage.

Despite this, I'm starting to suspect that the future of SAN connectivity will be iSCSI over copper Ethernet.

Ethernet and IP technologies have basically beaten everything else out there and now dominate computer networks and IP telephony. Why have a different standard for storage networks? Consolidation of the fabric for LAN, SAN and VOIP seems logical to me. Share the components and reduce the total cost.

So why continue to specify fibre channel? We already have a large investment in fibre channel, it's a known quantity and is well supported. Sun Solaris has a very mature FC implementation, and as of VI3, VMware works best on FC (not sure yet if vSphere changes this).

It's also faster (for us). We currently have 2Gbit switches but will be adding 4Gbit switches later this year and early next. Sure, 10Gbit Ethernet is available, but it's still too expensive for us to deploy (especially when adding the cost of switches and NICs).

But fast forward three years and I would expect the following:

  • 10Gbit Ethernet switches at a reasonable price with 40Gbit or 100Gbit inter-switch links
  • 10Gbit NICs with TCP Offload Engine (TOE) as standard and cheap
  • iSCSI boot as standard on these 10Gbit NICs (some do already, but it's not guaranteed)
  • Better support in the hypervisor / operating system for iSCSI

At the end of the day, managing a single fabric is easier than juggling a bundle of different cable types, protocols, HBAs and drivers.

It's always risky in this business to speculate how things might look in 3 years. If you disagree, please let me know why; it's always good to get alternative views...

Friday, 10 July 2009

Sun StorageTek 2540 and ESX troubleshooting

We experienced a few issues with the StorageTek 2540 array that forms the core of our SAN recently. The symptom was that the array flagged itself as being in a degraded state and that one or more volumes were not assigned to the preferred controller.


The first step was to upgrade the SAN firmware and Common Array Manager (CAM) software to the latest release. Despite this, we observed the problem again. Further digging into the problem found that the failover was happening when we performed a LUN rescan under VMware ESX.


My previous understanding was that there were essentially two types of arrays: active/active and active/passive. In the active/active configuration, both controllers in an array can service I/O requests to a specific volume concurrently. In an active/passive configuration, one [active] controller handles the I/O with the second [passive] controller sitting idle, only servicing I/O if the active controller fails.


I understood the StorageTek 2540 to be an active/passive array; it is only possible to assign a volume to one controller at any time. However, in order to improve the throughput of the array, different volumes can be assigned to different controllers. For example, a volume “VOL1” might be assigned to controller A as its active controller and to controller B for its passive controller, while volume “VOL2” might be assigned to controller B as its active controller and controller A as its passive controller.


It turns out that things are more subtle than this; there is a third type of array configuration: asymmetric.


The asymmetric configuration follows the active/passive model in that only one controller is servicing I/O for a specific volume at any time, but extends this by allowing I/O operations to be received by the second controller. If this happens, the array will automatically failover the volume to the second controller to service the request. This process is called Automatic Volume Transfer (AVT). If the first controller then receives I/O operations, the AVT moves the volume back.


Yes, this could cause some flapping between controllers. It can also cause I/O stalls as the controllers fail across.


Some of the array initiator types (such as Solaris with Traffic Manager (aka MPxIO)) disable AVT, others, including the Linux initiator that we’ve used on our VMware hosts, have AVT enabled.


So the problem we’re having appears to be caused by the array failing over a volume to its second controller. But why is it doing this? The only configuration I had performed on the ESX side was to ensure the multi-pathing option was set to Most Recently Used (MRU); the correct setting for active/passive arrays. What appears to have happened is that when booting, the ESX servers are not mapping to a consistent path. Out of our five ESX servers, three were setting one controller as active, while the other two servers were setting the second controller as active. Presumably, when one of the hosts (that has the wrong active path) performs a scan, the request is sent to the failover controller which invokes AVT and fails over the volume.


How to fix?


Sun have told me that the next version of CAM, due in a few weeks, will include a “VMware” initiator type which will disable AVT. This will negate the need to perform the NVSRAM hack in VMware’s Knowledge Base, but will require a firmware upgrade.


In the meantime, it might be a case of just ensuring that all the ESX hosts are using the same path to connect to each volume. This is all theory as I’m still working this out, but at least it’s all starting to make sense.


Although not specifically VMware or 2540 related, the following links provide some interesting reading around the subject:


Sun discussion forum thread about preferred and owner controllers


Linux kernel mailing list post detailing a bug experienced with multipath and asymmetric arrays


Saturday, 4 July 2009

Note-taking on the Cloud

In the "old" days, things were pretty simple; the Palm handled contacts, calendar, tasks and notes well, allowing me to carry everything with me but still sync them with my PC.

Of course, things are more advanced these days: Multiple computers, at home and at work, the iPhone providing near continuous Internet connectivity, web based services and richer software applications. But despite this, I'm still struggling to get perfect syncing across all platforms. Here's where things are for me today:

  • Contacts: Google Contacts synced with the iPhone, but no Outlook/Exchange integration.
  • Calendar: Google Calendar synced with the iPhone, but primary calendar only.
  • Tasks: Still using Outlook/Exchange for this. No sync.
  • Notes: Some notes in Outlook, some notes in OneNote, a few notes on the iPhone. Nothing syncs.

The last one is particularly disappointing as note synchronisation shouldn't be difficult. I tried using Google Notebook for a while until Google got bored and dropped it.

It was then I tried Evernote. There is a free version, provided you don't exceed a certain number of notes per month and there are clients for Windows, Mac OS X and the iPhone. There's also a web interface for when I'm in Linux or on a public machine.

My preference in terms of note taking functionality and power is OneNote. The only downside to this application is that notes are basically locked to the client PC, or synced to the corporate SharePoint server at best.

In comparison with OneNote, Evernote has fewer features, and an interface that is less rich. The Mac and Windows versions both have different levels of functionality (the Mac has a nicer set of views IMHO). But in it's favour, any notes I make, on any of my devices, now sync with the cloud.

Evernote is therefore my de facto notes application. For now.