关于zsync和rsync+inotify的同步算法与适用场景(高级讨论,慎入)
时间:2010-03-17
来源:互联网
本帖最后由 shinelian 于 2010-03-18 00:07 编辑
引用: http://everythinglinux.org/rsync/
rsync
Diffs - Only actual changed pieces of files are transferred, rather than the whole file. This makes updates faster, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.
引用:http://zsync.moria.org.uk/index
zsync is a file transfer program. It allows you to download a file from a remote server, where you have a copy of an older version of the file on your computer already. zsync downloads only the new parts of the file. It uses the same algorithm as rsync. However, where rsync is designed for synchronising data from one computer to another within an organisation, zsync is designed for file distribution, with one file on a server to be distributed to thousands of downloaders. zsync requires no special server software zsync requires no special server software just a web server to host the files and imposes no extra load on the server, making it ideal for large scale file distribution.
zsync还有其他的特性如 Rsync over HTTP,Handling for compressed files,大家详见文档。
问1: rsync和zsync的适用场景是什么,为什么会这样,如果用了其中一种,如何调优。
rsync有几个选项:
-W, --whole-file copy whole files, no incremental checks 可能有助于理解本case
--no-whole-file turn off --whole-file
-S, --sparse handle sparse files efficiently 这个选项大家如何理解,在效率上如何体现的。
-c, --checksum always checksum 这个对行为模式和效率是如何影响的。
--partial keep partially transferred files 这也是个很有意思的选项。
我看了一些官方文档,有些结论,不知道对不对。
举例, 如果5000个rsync client同时去rsync一个只变化了50byte的1G的文件,会导致rsync server 计算5000次checksum,会导致rsyncd server过载,每个client端的同步数据为50byte。
同样的,如果5000个zsync client同时去zsync一个只变化了50byte的1G的文件,如果使用zsync,差异部分预选算好,只传差异部分,不会导致zsyncd http server过载,因为zsync还多个metadata,metadata已经包含了处理差异部分的信息,每个zsync client端的同步数据为50byte。
希望有经验的哥们指点一二。
附:zsync的一些简单用法:
zsyncmake -C -u http://ftp.uk.debian.org/debian/dists/sarge/main/binary-i386/Packages.gz Packages.gz
Note use of -C to save the client compressing the file on receipt; the Debian package system uses the file uncom-
pressed.
zsyncmake -z my-subversion-dump
In this case there is a large, compressible file to transfer. This creates a gzipped version of the file (optimised for
zsync), and a .zsync file. A URL is automatically added assuming that the two files will be served from the same direc-
tory on the web server.
zsyncmake -e -u http://www.mirrorservice.org/sites/ftp.freebsd.org/pub/FreeBSD/ports/distfiles/zsync-0.2.2.tar.gz zsync-0.2.2.tar.gz
This creates a zsync referring to the named source tarball, which the client should download from the given URL. This
example is for downloading a source tarball for a FreeBSD port, hence -e is specified so the client will be able to
match its md5sum.
zsyncmake man
http://www.helplinux.cn/man/1/zsyncmake.html
zsync man
http://www.helplinux.cn/man/1/zsync.html
参考文献:翻译好几段,保存时失败,杯具。
看了这几段overview,应该可以证明之前的结论正确。
HTTP already provides the Range header for transferring partial content of files. This is useful only if you are able to determine from some other source of information which are the changed sections. If you know that a file is a log and will only ever grow — existing content will not change — then Range is an effective tool. But it does not solve the problem by itself.
There are alternative download technologies like BitTorrent, which break up the desired file into blocks, and retrieve these blocks from a range of sources [BitT2003]. As BitTorrent provides checksums on fragments of file content, these could be used to identify content that is already known to the client (and it is used for this, to resume partial downloads, I believe). But reusing data(复用数据) from older files is not a purpose of this data in BitTorrent — only if exactly matching blocks could be identified would the data be any use.
The best existing solution from the point of view of minimising data transfer is rsync. rsync uses a rolling checksum algorithm that allows the checksum over a given block length at all points in a file to be calculated efficiently. Generally speaking, a checksum would have to be run at every possible start point to achieve this — the algorithm used in rsync (see [Rsync1998]) allows the checksum window to be rolled forward over the file and the checksum for each new location to be trivially derived from the previous checksum and the values at the window edges. So rsync can calculate the checksum at all points in the input file by streaming through the file data just once. While doing so it compares each calculated checksum against the list of checksums for the existing data file, and spots any chunks from the old data file which can be reused.
So rsync achieves a high level of data reuse. It comes at a high computational cost, however. The current rsync implementation calculates the checksums for a set of blocks on the client, then uploads these to the server; the server them uses the rsync algorithm to work out which blocks the client has and which it needs, and pushes back the blocks it needs. But this approach suffers many drawbacks:
The drawbacks with rsync have prevented it being deployed widely to distribute files to the general public. Instead, it has been used in areas closer to the existing use of cvs and sup, where a limited community of users use an rsync server to pull daily software snapshots. rsync is also very widely used inside organisations for efficient transfer of files between private systems, using rcp or scp as a tunnel. rsync also has very powerful functionality parallelling cp -a and tar's abilities, with transfers of file permissions, directory trees, special files, etc. But public releases are rarely made with rsync, as far as I can tell.
I should also mention rproxy. While I have not used it myself, it is an attempt to integrate the rsync algorithm into the HTTP protocol [RProxy]. An rproxy-enabled client transmits the rsync checksums of blocks of data it already has to the server as part of the HTTP request; the server calculates the rolling checksum over the page it would have transmitted, and transmits only the blocks and the meta-information needed for the client to construct the full page. It has the advantage of integrating with the existing protocol and working even for dynamic pages. But it will, I suppose, suffer the same disk and CPU load problems as rsync on large files, and is an unwelcome overhead on the server even for small files. Since server administrators are rarely as concerned about bandwidth and download time as the client, it is hard to see them wanting to put extra work on their servers by offering either rsync or rproxy generally.
CVS and subversion provide a specialised server programs and protocols for calculating diffs on a per-client basis. They have the advantage of efficiency once again, by constructing exactly the diff the client needs — but lose on complexity, because the server must calculate on a per-client basis, and the relatively complicated server processing client requests increases the risk of security vulnerabilities. CVS is also poor at handling binary data, although subversion does do better in this area. But one would hardly distribute ISO images over either of these systems.
Hybrid protocols have been designed, which incorporate ideas from several of the systems above. For instance, CVSup [CVSup1999] uses CVS and deltas for version-controlled files, and the rsync algorithm for files outside of version control. While it offers significantly better performance than either rsync or CVS, due to efficient pipelining of requests for multiple files, it does not fundamentally improve on either, so the discussion above — in particular the specialised server and high server processing cost per client — apply.
暂时的结论:
1. 使用rsync+inotify,client多了,cpu,memory撑不住。
第3方文献:
http://www.gaojinbo.com/rsync%E7%9A%84%E5%87%A0%E7%A7%8D%E4%BC%98%E5%8C%96%E5%BA%94%E7%94%A8%E6%96%B9%E6%A1%88.html 高兄这篇文章真是经验之谈。
引用: http://everythinglinux.org/rsync/
rsync
Diffs - Only actual changed pieces of files are transferred, rather than the whole file. This makes updates faster, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.
引用:http://zsync.moria.org.uk/index
zsync is a file transfer program. It allows you to download a file from a remote server, where you have a copy of an older version of the file on your computer already. zsync downloads only the new parts of the file. It uses the same algorithm as rsync. However, where rsync is designed for synchronising data from one computer to another within an organisation, zsync is designed for file distribution, with one file on a server to be distributed to thousands of downloaders. zsync requires no special server software zsync requires no special server software just a web server to host the files and imposes no extra load on the server, making it ideal for large scale file distribution.
zsync还有其他的特性如 Rsync over HTTP,Handling for compressed files,大家详见文档。
问1: rsync和zsync的适用场景是什么,为什么会这样,如果用了其中一种,如何调优。
rsync有几个选项:
-W, --whole-file copy whole files, no incremental checks 可能有助于理解本case
--no-whole-file turn off --whole-file
-S, --sparse handle sparse files efficiently 这个选项大家如何理解,在效率上如何体现的。
-c, --checksum always checksum 这个对行为模式和效率是如何影响的。
--partial keep partially transferred files 这也是个很有意思的选项。
我看了一些官方文档,有些结论,不知道对不对。
举例, 如果5000个rsync client同时去rsync一个只变化了50byte的1G的文件,会导致rsync server 计算5000次checksum,会导致rsyncd server过载,每个client端的同步数据为50byte。
同样的,如果5000个zsync client同时去zsync一个只变化了50byte的1G的文件,如果使用zsync,差异部分预选算好,只传差异部分,不会导致zsyncd http server过载,因为zsync还多个metadata,metadata已经包含了处理差异部分的信息,每个zsync client端的同步数据为50byte。
希望有经验的哥们指点一二。
附:zsync的一些简单用法:
zsyncmake -C -u http://ftp.uk.debian.org/debian/dists/sarge/main/binary-i386/Packages.gz Packages.gz
Note use of -C to save the client compressing the file on receipt; the Debian package system uses the file uncom-
pressed.
zsyncmake -z my-subversion-dump
In this case there is a large, compressible file to transfer. This creates a gzipped version of the file (optimised for
zsync), and a .zsync file. A URL is automatically added assuming that the two files will be served from the same direc-
tory on the web server.
zsyncmake -e -u http://www.mirrorservice.org/sites/ftp.freebsd.org/pub/FreeBSD/ports/distfiles/zsync-0.2.2.tar.gz zsync-0.2.2.tar.gz
This creates a zsync referring to the named source tarball, which the client should download from the given URL. This
example is for downloading a source tarball for a FreeBSD port, hence -e is specified so the client will be able to
match its md5sum.
zsyncmake man
http://www.helplinux.cn/man/1/zsyncmake.html
zsync man
http://www.helplinux.cn/man/1/zsync.html
参考文献:翻译好几段,保存时失败,杯具。
看了这几段overview,应该可以证明之前的结论正确。
HTTP already provides the Range header for transferring partial content of files. This is useful only if you are able to determine from some other source of information which are the changed sections. If you know that a file is a log and will only ever grow — existing content will not change — then Range is an effective tool. But it does not solve the problem by itself.
There are alternative download technologies like BitTorrent, which break up the desired file into blocks, and retrieve these blocks from a range of sources [BitT2003]. As BitTorrent provides checksums on fragments of file content, these could be used to identify content that is already known to the client (and it is used for this, to resume partial downloads, I believe). But reusing data(复用数据) from older files is not a purpose of this data in BitTorrent — only if exactly matching blocks could be identified would the data be any use.
The best existing solution from the point of view of minimising data transfer is rsync. rsync uses a rolling checksum algorithm that allows the checksum over a given block length at all points in a file to be calculated efficiently. Generally speaking, a checksum would have to be run at every possible start point to achieve this — the algorithm used in rsync (see [Rsync1998]) allows the checksum window to be rolled forward over the file and the checksum for each new location to be trivially derived from the previous checksum and the values at the window edges. So rsync can calculate the checksum at all points in the input file by streaming through the file data just once. While doing so it compares each calculated checksum against the list of checksums for the existing data file, and spots any chunks from the old data file which can be reused.
So rsync achieves a high level of data reuse. It comes at a high computational cost, however. The current rsync implementation calculates the checksums for a set of blocks on the client, then uploads these to the server; the server them uses the rsync algorithm to work out which blocks the client has and which it needs, and pushes back the blocks it needs. But this approach suffers many drawbacks:
-
The server must receive and act on a large volume of data from the client, storing it in memory, parsing data, etc — so there is the opportunity for denial of service attacks and security holes. In practice rsync has had a remarkably good security record: there have been a few vulnerabilities in the past few years (although at least one of these was actually a zlib bug, if I remember rightly).
The server must reparse the data each time. It cannot save the computed checksums. This is because the client sends just the checksums for disjoint blocks of data from its pool of known data. The server must calculate the checksum at all offsets, not just at the block boundaries. The client cannot send the checksum at all points, because this would be four times larger than the data file itself — and the server does not want to pre-compute the checksums at all points, because again it would be four times larger, and require four times as much disk activity, as reading the original data file. So CPU requirements on the server are high. Also the server must read the entire file, even if the final answer is that the client requires only a small fragment updated.
Memory requirements for the server are high - it must store a hash table or equivalent structure of all the checksums received from the client while parsing its own data.
The drawbacks with rsync have prevented it being deployed widely to distribute files to the general public. Instead, it has been used in areas closer to the existing use of cvs and sup, where a limited community of users use an rsync server to pull daily software snapshots. rsync is also very widely used inside organisations for efficient transfer of files between private systems, using rcp or scp as a tunnel. rsync also has very powerful functionality parallelling cp -a and tar's abilities, with transfers of file permissions, directory trees, special files, etc. But public releases are rarely made with rsync, as far as I can tell.
I should also mention rproxy. While I have not used it myself, it is an attempt to integrate the rsync algorithm into the HTTP protocol [RProxy]. An rproxy-enabled client transmits the rsync checksums of blocks of data it already has to the server as part of the HTTP request; the server calculates the rolling checksum over the page it would have transmitted, and transmits only the blocks and the meta-information needed for the client to construct the full page. It has the advantage of integrating with the existing protocol and working even for dynamic pages. But it will, I suppose, suffer the same disk and CPU load problems as rsync on large files, and is an unwelcome overhead on the server even for small files. Since server administrators are rarely as concerned about bandwidth and download time as the client, it is hard to see them wanting to put extra work on their servers by offering either rsync or rproxy generally.
CVS and subversion provide a specialised server programs and protocols for calculating diffs on a per-client basis. They have the advantage of efficiency once again, by constructing exactly the diff the client needs — but lose on complexity, because the server must calculate on a per-client basis, and the relatively complicated server processing client requests increases the risk of security vulnerabilities. CVS is also poor at handling binary data, although subversion does do better in this area. But one would hardly distribute ISO images over either of these systems.
Hybrid protocols have been designed, which incorporate ideas from several of the systems above. For instance, CVSup [CVSup1999] uses CVS and deltas for version-controlled files, and the rsync algorithm for files outside of version control. While it offers significantly better performance than either rsync or CVS, due to efficient pipelining of requests for multiple files, it does not fundamentally improve on either, so the discussion above — in particular the specialised server and high server processing cost per client — apply.
暂时的结论:
1. 使用rsync+inotify,client多了,cpu,memory撑不住。
第3方文献:
http://www.gaojinbo.com/rsync%E7%9A%84%E5%87%A0%E7%A7%8D%E4%BC%98%E5%8C%96%E5%BA%94%E7%94%A8%E6%96%B9%E6%A1%88.html 高兄这篇文章真是经验之谈。
作者: shinelian 发布时间: 2010-03-17
没用过zsync,和和,
作者: badb0y 发布时间: 2010-03-17
留名,回头再详细看
作者: william0427 发布时间: 2010-03-18
相关阅读 更多
热门阅读
-
office 2019专业增强版最新2021版激活秘钥/序列号/激活码推荐 附激活工具
阅读:74
-
如何安装mysql8.0
阅读:31
-
Word快速设置标题样式步骤详解
阅读:28
-
20+道必知必会的Vue面试题(附答案解析)
阅读:37
-
HTML如何制作表单
阅读:22
-
百词斩可以改天数吗?当然可以,4个步骤轻松修改天数!
阅读:31
-
ET文件格式和XLS格式文件之间如何转化?
阅读:24
-
react和vue的区别及优缺点是什么
阅读:121
-
支付宝人脸识别如何关闭?
阅读:21
-
腾讯微云怎么修改照片或视频备份路径?
阅读:28