In short: a python script that lets you download a single file from a ZIP archive placed on a HTTP-server that supports sending partial content. See the comment in the code for more info :)
Please note there is totally no error checking.
Also, on Python 2.5 and older this code has a path traversal while saving the target file, so it's best to use it only on trusted archives (or with Python 2.6+).
You can set DEBUG to True to see how many packets actually flow. It seems to be a sane amount - normally one packet per header + a packet(s?) for the variable fields, and another one for data (so it was 3 reads to list the file in the example blow, and another 3 to get the data).
I guess this could be optimized in zipfile a little more (e.g. reading all the central directory entries could be read in one shot, since the size of the central directory is in the end-of-central-directory record; also, the file data could be read with the file name / extra headers), but this makes only sense if each "read" is really slow.
Same goes for my code - there is no need to disconnect/reconnect each time.
Well, but this was supposed to be a small experiment anyway :)
Some output / example of usage:
14:44:47 gynvael> python zipdl.py http://gynvael.vexillium.org/dump/example.zip
File Name Modified Size
readme_EndFirst.txt 2013-05-13 14:30:34 231
14:44:55 gynvael> python zipdl.py http://gynvael.vexillium.org/dump/example.zip readme_EndFirst.txt
14:45:39 gynvael> ls -la readme_EndFirst.txt
-rw-r----- 1 gynvael gynvael 231 May 13 14:45 readme_EndFirst.txt
14:45:42 gynvael>
And the code itself:
#!/usr/bin/python
# A small PoC of making a HTTP-backed file-like object. In this case it's
# used by the zipfile library, so you can basically list all the files in
# a ZIP archive that's placed on a server that supports partial downloads.
# You can also download just a single specific file from that archive.
# This might be useful for huge archives where you need only a couple of
# smaller files :)
# Consider this public domian, no magic is here.
# Initially written by gynvael.coldwind//vx (2013)
import zipfile
import os
import sys
import httplib
import urlparse
DEBUG=False
def HTTPGetFileSize(url):
u = urlparse.urlsplit(url)
conn = httplib.HTTPConnection(u.netloc)
path = u.path
if len(u.query) > 0:
path += "?" + u.query
conn.request("HEAD", path)
res = conn.getresponse()
if res.status != 200:
print res
return False
data = res.getheader("Content-Length")
conn.close()
return int(data)
def HTTPGetPartialData(url, f, t):
u = urlparse.urlsplit(url)
conn = httplib.HTTPConnection(u.netloc)
path = u.path
if len(u.query) > 0:
path += "?" + u.query
conn.request("GET", path, "", {
"Range": "bytes=%u-%u" % (f, t)
})
res = conn.getresponse()
if res.status not in [200, 206]:
print res.status, res.reason
return False
data = res.read()
conn.close()
return data
class MyFileWrapper:
def __init__(self, url):
self.url = url
self.position = 0
self.total_size = HTTPGetFileSize(url)
if self.total_size == False:
raise Exception("file not found or sth like that")
pass
def seek(self, offset, whence):
if whence == 0:
self.position = offset
elif whence == 1:
self.position += offset
elif whence == 2:
self.position = self.total_size + offset
if DEBUG==True:
print "seek: (%u) %u -> %u" % (whence, offset, self.position)
pass
def tell(self):
if DEBUG==True:
print "tell: -> %u" % self.position
return self.position
def read(self, amount=-1):
if amount == -1:
amount = self.total_size - self.position
d = HTTPGetPartialData(self.url, self.position, self.position + amount - 1)
self.position += len(d)
if DEBUG==True:
print "read: %u %u -> %u" % (self.position - len(d), amount, self.position)
return d
# Let's start the code.
if len(sys.argv) not in [2, 3]:
print "usage: zipdl.py <URL-to-zip> [<filename-to-extract>]"
sys.exit(1)
f = MyFileWrapper(sys.argv[1])
z = zipfile.ZipFile(f, "r")
if len(sys.argv) == 2:
z.printdir()
else:
# Note, running this on Python 2.5 is shooting urself in the foot
# since there are no anti-path-traversal measures in <2.6.
z.extract(sys.argv[2])