Now and again in system administration you come across a problem that at first seems intractable and often you end up finding a work around and have to accept you’ll never fully find out what the root cause was. Then sometimes you find some little chink in to the problem and with the right tools you can get to the bottom of it.
Recently I’ve had just such a problem and Python has been the tool that helped me get to the bottom of it.
I spend a lot of time dealing with various aspects of FTP and recently I started seeing odd file transfer failures and scripts hanging with open FTP sessions. The main FTP server transfers about 3000 files a day so the occasional hiccough is perhaps inevitable but this was an unusually high level of failures.
I looked at the error messages I was seeing in the logs of various transfers an they showed a server response saying:-
500 Command Not Understood ‘PORT 10,225,1,89,124,22_’
Now I can see why a server would complain about a port command that said that. For those of you who are not as intimately familiar with the RFC959 for the FTP protocol as I have unfortunately had to become firstly I’d recommend you read this excellent site that explains the key parts of the protocol and especially the difference between active and passive transfer modes.
However all you really need to know is that the PORT command is used to tell a server what port it can talk to the client on. The first 4 numbers (octets) are a standard 4 dot quad IP address and the last two octets are supposed to be the port number, so octet 5 x 256 + octet 6 gives you the number in decimal.
For example ‘PORT 10,225,61,50,127,200’ means I’m on IP 10.225.61.50 talk to me on port 127 x 256 + 200 = 32512 + 200 = 32712
So you can see why the server would say ‘PORT 10,225,1,89,124,22_’ is not understood. What exactly is ’22_’ supposed to be?
Ok that was what was causing the FTP transfers to fail and/or scripts to hang but what was causing the problem? As ever you ask the question “what has changed?” As it turns out I knew our DMZ firewall had been updated so I spoke to the network security team and of course they asked for more detail to help them track things down and could I reproduce it so they could check the logs.
Well long story short I started doing lots of semi-automatic shell scripted transfers and discovered that when the last octet of the port command was 227 it would cause this error. It seemed bizarre but that is what the problem seemed to be. The only snag was I had to do a couple of hundred transfers from a client to hit a matching port and if something else on the box nicked the port that ended in 227 for some other purpose I would have to cycle round another 256 odd transfers.
What I needed was a simple way to establish an FTP connection and then specify a port that had the last octet in 227. I’d used the FTP library in Python to script transfers and sure enough this was also the way I could specify a port and have a quick a simple way to recreate the problem.
I fired up the Idle interactive environment and used the following few lines to recreate the problem:-
from ftplib import FTP
ftp = FTP(‘10.202.3.9’, ‘uuuuuuu’ , ‘pppppp’)
‘200 PORT Command successful’
error_perm: 500 Command not understood ‘PORT 10,225,51,60,125,22_’
Three quick lines and I could reproduce my error. By using this I could establish that only transfers that went via the DMZ firewall would have this problem and the security guys could turn on extra logging while I recreated the problem.
Thanks to Python I had a quick, simple and interactive way to test my hypothesis.
All I need now is for the security guys to work out what is causing this but at least they have a much better idea where to start than me saying “well about 1 transfer in 250 seems to fail.” Kudos to the Python.