Completed 10 years ago (Nov 3, 2006, 4:05:30 PM)
Although connected mode is where it all started, Coda having it's roots in AFS2 and all. Everything in Coda's design is inherently favoring disconnected and write-disconnected operation. The lack of locking, the fact that we notify clients of an update only after it happened. It is simply not possible to force an operation by infinitely retrying it because we do not have "last writer wins" semantics, but "last writer gets a conflict". Applications that absolutely rely on connected mode consistency and semantics are doomed for failure as soon as someone trips over a network cable or when the client decides it is better to back off and work (write-)disconnected for a while as a result of server load, network congestion, etc.
Instead of treating logging and reintegration as an occasionally used fallback for 'normal' operation, I'd rather have it as the main focus of the Coda File System and make it the best it can possibly be. And there are definite advantages to always using write-back logging,
Changes are logged and written back to the servers asynchronously minimizing the time we block the application. We also send updates in batches where the server can now commit up to 100 operations in a single transaction, which is considerably more efficient. Finally, clients optimize the logs to remove operations that would cancel each other out. f.i. creation/removal of temporary files, multiple writes to the same file, etc.
Connected mode and write-disconnected mode are very different in behaviour (and implementation). By not having the hard switchover point the story becomes a lot simpler. There are now just 2 numbers,
- How long until changes are eligible for writeback.
- How much time are we allowed to spend on writing back changes.
(the second number defines how long the volume is locked and in a way how much of the available bandwidth is used for write purposes)
About once every 5 second all volumes are checked and any pending changes eligble for writeback are pushed back to the servers. If there is too much queued, we simply continue with the rest the next time we check the writeback logs. By default, the values are set to 0 and 1.0 seconds respectively, this combines reasonable consistency as everything is immediately eligible for reintegration, with smooth adaptation when we have limited bandwidth as we only use about 1/5th of the available time/bandwidth for writeback purposes.
Deep down these numbers have always existed for write-disconnected volumes, but they defaulted to 30 seconds for aging and 60 seconds for the writeback period. They were also not stored persistently so any switch to or from another connectivity mode would override set user preferences.
Not having separate connected and log-based write paths also resolves an issue that was hard to solve with the old clients. A connected store that completed successfully on the server, but where the client was disconnected before it received the final reply would get logged as a pending update by the client. When the logged store was reintegrated, the server would flag it as an update/update conflict. When we always log we only reintegrate the operation and reintegration knows how to correctly detect and handle retries after a disconnection.
Using only write-disconnection for over a year has really forced me to focus on the reliability of reintegration and resolution. Most of the server-side improvements in recent releases are a direct result of my need for reliable write-disconnected operation.
The combined patches remove a little over 5000 lines from the Coda client, this is almost 15% of the code in coda-src/venus. At some later point when we remove the connected operations from the server we can drop an additional 1300 lines. Fewer lines of code means less code to maintain (and ideally fewer bugs).
Of course there are drawbacks, all of these actually exist with the existing clients, they just are less avoidable / more visible,
Local changes hide global updates
When we create a new file in a directory, we cannot refresh the locally cached copy until all changes have been pushed to the server. This is not as noticeable when there is no sharing between clients or when changes are reintegrated quickly. However not being able to refresh dirty cached data leads to,
More complex conflicts
There used to be 2 types of conflict, server-server (result of failed resolution), and local-global (resulting from failed reintegrating). Only occasionally reintegration failed because of a server-server inconsistency. This didn't happen all that often, probably because most people tried to keep their clients in connected mode in which case the operation would simply fail instead of getting preserved in the log. But now as all modifications are logged and reintegrated, when we fail the log record stays around as a conflict.
This is good in one way, because we won't lose conflicting updates. On The other hand these conflicts were impossible to repair because both types of conflict are expanded and handled differently. We used to bring in a second client to first repair the underlying server-server inconsistency before we could try to fix the reintegration conflict.
The new client code actually tries to unify the way a conflict is expanded, and to some extend attempts to repair all conflicts. However repair is still an area that is not completely sorted out and although server-server conflict repair works as well as before, local-global and local-server-server cases still require more work. Also the reintegration and resolution improvements on the server have made many of the unnecessary conflicts that plagued reintegration a thing of the past.
Cooperation between multiple authors
Because updates are logged and written back asynchronously, working together on the same set of files requires some user action to make sure the updates are propagated. Either use 'cfs fr' (forcereintegrate) to push pending updates to the servers at a synchronization point, or set the reintegration age to a low value.
There is also a special setting which makes reintegration occur synchronously. When both reintegration age and time are set to 0 we reintegrate all pending changes before venus returns to the application. Behaviour-wise this comes really close to the old 'connnected mode operation' combined with 'cfs strong', and in fact using 'cfs strong' will place the specified volume in this synchronous reintegration mode.
Although I do not believe it fundamentally changes anything for the end-user, it definitely changes some of the thinking about how the system works. And probably in a good way, if you assume your writes are always logged and delayed, you won't get a nasty surprise when Coda decides it may be better to disconnect for a while.
One question was what version number to use for this release. I think it is 95% there, but it probably isn't ready for a major release. On the other hand it is more of a milestone than a 6.1.3 release would indicate. So I bumped the version to 6.9.0, and at some point it will become the future Coda-7.0.