Hacker Newsnew | comments | show | ask | jobs | submit | login

According to this article, ZFS is sub-optimal for databases due to fragmentation of the db files. Could you please comment? Thx. http://bartsjerps.wordpress.com/2013/02/26/zfs-ora-database-...

-----


Oracle advice on how to configure ZFS and Oracle Database

http://www.oracle.com/technetwork/server-storage/solaris/con...

Some picks:

Free space is important 10-20%.

>In an environment with a high rate of data update (or churn), it is advisable to maintain a certain amount of free space. ZFS is a copy-on-write file system and relocates all writes to free disk space. Keeping a certain amount of free blocks allows ZFS to easily find space in big chunks, allows ZFS to aggregate writes and reduce the write IOPS demand on hard disks. Streaming aggregated writes can be 50-100 times less demanding on hard drives than doing small random writes. Therefore, the incentive for keeping free space is high, especially in a high churn environment, such as an active OLTP database.

>The number one rule for setting up an Oracle database on ZFS is to set ZFS recordsize equal to the database block size for the file systems that contain the Oracle data files. recordsize = db_block_size

-----


Here's my take on a solution to the problem of linkrot: https://www.purplerails.com/

Major points:

* Automatically saves an exact copy of pages (no need to explicitly bookmark) in the background.

* Data is encrypted with your password before being sync'd to cloud.

* Search through your pages.

* Works as a Chrome browser extension. No need for a native app.

-----


Hi! Same thing here as with the other examples mentioned in this thread: this only helps you.

If you save a page but someone else needs it, they're out of luck.

But, if, in addition to making you a private, encrypted archive, they also tested to see if the URL was publicly visible and, if so, made a WARC of it, then they could package up all those WARCs for donation to the Internet Archive, and everyone could benefit.

-----


> If you save a page but someone else needs it, they're out of luck.

There is a sharing feature to solve this problem. :)

But I agree with your point.

I actually looked into WARC earlier but didn't have the bandwidth to do it my first version. When I implement the ability to download your data, I'll try hard to use WARC. Unless there's some brain damage in the format: I hope not! :)

-----


You have to save the WARC-required stuff on the initial capture, because it's a dump of the client/server conversation as well as the content. But thanks for thinking about it!

Here are some previous comments with links that might be useful:

https://news.ycombinator.com/item?id=6506032

https://news.ycombinator.com/item?id=6671152

-----


In Firefox 3, the default value for the lifetime of entries in the browser history was changed from 9 days to 99 days. In subsequent releases, it was changed to "indefinite, or whatever's reasonable, after applying some heuristics for the machine we're on".

A while back, I imagined bringing page content itself—and not just choice metadata like its URL and title—into the purview of the browser's history backend, too, effectively enabling WYGOIWYGO (What You Got Online Is What You Get Offline).

(I started off, funnily enough, not trying to imagine the next logical step in the "moar history" march, but instead with the Internet Archive in mind. I was trying to think of a way that would give ordinary plebs a zero-effort way to add to the Wayback Machine actual archived content in the same way that Alexa and the Internet Archive were slurping in data from the Alexa Toolbar about what pages are getting hits out there.)

After the stuff that happened last January with Aaron Swartz, I was even motivated to write up some use cases and gave it the codename "Permafrost":

Ashley just wants all the content he bookmarks (or simply accesses) to be always available to him, without being frustrated months or years from now by 404s, service shutdowns, and web spiders stretched too thin, allowing his favored content to slip through the cracks of their archiving efforts.

- < https://wiki.mozilla.org/Permafrost >

Even so, it remains one of those projects that I should really get around to kicking off someday, but may never end up starting, much less get close to "completing".

-----


If I need to rely on your private domain to access my own research, how is it different or less risky than diigo etc? I'm waiting for an extension that lets me keep my own full data activity (and use your cloud too, optionally in addition).

-----


Understood. Thx for the comment. The ability to download your data in a well-documented format (possibly WARC) is coming soon. I hope you will try out PurpleRails in the meanwhile. Thx again!

-----


I've been thinking about this for a while now. Please check out my web app to solve this problem: https://www.purplerails.com/

The main idea is to use a browser extension to automatically save pages that you read to the cloud (including the images, stylesheets etc) in the background. Saved pages are searchable and sharable.

-----


This sounded really great until I went to the website and saw that I can't use my own cloud storage, only purplerails'. As soon as purplerails disappears all my saved pages are gone. I already have this functionality with diigo and it makes me very uncomfortable not to have a copy of the data.

-----


Excellent point. The ability to download your data in a well-documented format is coming soon. See also my reply to hollerith on a native client.

Time limitations are what is preventing me from doing this.

Thanks for your feedback! Hope you will use Purplerails. :)

-----


An early design idea I had for Pinboard was as a browser plugin that just saved everything it saw in passing to an upstream server. But the problem that stumped me was that there's much more downstream bandwidth than upstream on a typical residential connection, so it was hard to push things to a server in anything like real time. How did you end up dealing with this issue?

-----


Pages are saved in the background. Nothing too fancy. I dedup when uploading: that helps a lot.

-----


Okay, but how do you handle things like big PDFs or image gallery sites? Or pages that just pull in a lot of javascript includes? That stuff downloads in parallel, but then I would find myself trying to push it upstream through a little straw of bandwidth, sequentially.

-----


You're right, it takes longer to upload a page than to download it. And image gallery-like pages take long (I know because I save Imgur pages now than then :)). But in practice, this isn't a problem.

There are a couple of heuristics to avoid wastefully uploading pages: the full page is uploaded only if the reader expresses "sufficient interest" in the page. Currently the heuristic is 90 seconds of continuous reading of a page, or scrolling to the bottom. If a page is read for a minimum of 10 continuous seconds then only the text of the page is uploaded.

Static assets like JS files benefit from deduping: they take time to upload the first time, but subsequently processing them is much faster.

Typically, people read multiple pages in a browsing session: I rapidly open many tabs and then read each one for multiple seconds. There's a debug mode in Purple Rails in which a timer counts up when I switch to a tab. I find that typically spend 100+ (usually much more) seconds on a page that I read through to the end. This is usually enough time on a residential broadband connection (I have Sonic DSL) to finish uploading a page. I also use Purple Rails on a tethered 4G connection almost everyday: uploading is slower than DSL but it works.

Basically, by the time you finish reading a tab, the previous tab you read would have finished saving.

Like I said, nothing too fancy.

-----


Sweet, thanks for the detailed answer! I look forward to checking it out.

-----


How can you dedupe if the content is encrypted?

-----


Deduping is per-user not across users. For the sort of content here, this works well. E.g., static assets of web pages are the ones that are dedup-ed.

The basic algo is to generate a HMAC of the plaintext and compare it against a table of previously uploaded blobs' HMACs. HMAC is keyed with a key derived from the user's password. When a blob is uploaded, the ciphertext and HMAC of the plain text are both uploaded.

-----


I'd prefer for the pages to be saved to the hard drive of the machine running the web browser.

But maybe browser extensions cannot obtain permission to do that?

-----


I see your point. I may release installable native apps that can serve as the local storage backend for the truly paranoid. Time limitations etc: the usual. :)

I anticipate that PurpleRails will be used on multiple computers and over several years. Which is why pages are sync'd to the cloud.

I've adopted the current architecture because I feel that the energy barrier that needs to be overcome to persuade somebody to install an app is much higher than the one to install an extension.

-----


I tried giving purplerails a shot. I use LastPass for password management, and I am not going to type in a 25/30 character password every single time I want to log into an application. I think you're going way too hard on that part. This is the first time I've had this happen to me when using a web app, and it immediately made me close the page.

-----


Thanks for taking the time to write your feedback. I appreciate it.

With PurpleRails, you will rarely be typing your password once you successfully login. It works kinda like Gmail/Facebook etc: you remain logged in for months at a time.

This page explains why password saving doesn't work yet: https://www.purplerails.com/blog/saving-passwords-why-we-hav...

I'll see if I can do something fancier than what I do now to allow saved passwords to work. I suspect that using a Javascript-based AJAX authentication system instead of a plain-old HTML form, I can achieve the privacy goals as well as ease of use.

I hope you will hang in there and use the current version until I figure out a way to fix this. Thanks!

-----


Thanks for responding. Another 'issue' I've noticed is that it logs me out if I don't have the extension installed and click on the installation button, which, as you might figure, is a pain to deal with when you have a 25 character password. I guess you're saving the web page data from the client side and not through your servers. Another thing I've noticed is that when I add a page, it takes quite some time to show up in the web app. Also, whats with the timer? Apart from these issues, I really believe in your idea, and I have been working on making such a system for myself for months. I wish you the best, and I hope you succeed. I'll update you if I find any new issues, since if all the issues are fixed, I can see myself using this as my primary bookmarking service.

-----


Thanks for the report!

Logout if extension not installed is related to the same reasons why saved passwords don't work yet.

The timer is supposed to be a debug feature that I thought was off by default! :) It shows the amount of time you've spent in that tab. I'll turn it off by default in the next update. For now, go into the extension options and uncheck "show page view timer" (near the bottom of the options page).

wrt taking quite some time: the first time you save a page from a site, it's likely to take some time, since things like static assets will also be uploaded. Subsequent saves should be much faster due to dedup'ing which works well on static assets. Let me know if this is not the case.

-----


I see it also saves content from emails and all that. That certainly justifies the need for extensive security. What are your plans for pricing? I suggest you don't take the route that Pocket did, $5 for anything and everything(which seems like a norm these days) is ridiculous.

-----


You can go into the options and disable your email host. It's not the easiest interface :) but the format is hopefully obvious: it's a JSON array; if any of the strings appears as a substring in the page URL, that page is excluded from text-only or full-page saving.

Copy and paste the same thing into "Autoindex exclude rules" and "Autosave exclude rules".

Could you expand on your feedback on pricing? Can't tell if you're saying $5 is too high/too low. If you wish you can also email me (I couldn't find your email on your site).

-----


That wasn't too hard, but obviously a non-programmer would have difficulty understanding it. But hopefully and quite possibly you're working on a better interface, so not much of a problem for the time being.

I've got another piece of feedback. The numbers seem to be off for me in the web app. I t says I have over 300 pages bookmarked, while the interface only displays 100(which seems like the more plausible number).

$5 is kind of a norm these days, and a lot of times, its too high, depending on what the service is. For instance, there's IRCCloud. Obviously, the interface is good, and they provide easy to use mobile apps, but all that doesn't justify $60 per year. Another example is Pocket. I really like their apps, but they also followed the $5 norm. They aren't providing me much value to be worth $5(every month) to me, so I switched over to Pinboard(which, by the way, is one of those few services doing pricing right these days). There are a lot of services that I can justify paying $5 per month for, cloud storage for instance, but a bookmarking service, nope.

If you want to see the kind of backlash Pocket got for pricing their Premium option at $5, just have a look at this thread on reddit:

http://www.reddit.com/r/Android/comments/26qaif/pocket_intro...

-----


The 300 number is probably correct. The UI shows the most-recent 100 pages by default. I'll be adding 'next' and 'previous' page links soon. I mostly use the search function since I have many pages saved, so it hasn't been a high priority.

You can list all pages you've currently saved by using adding '?n=10000' to the list page URL. Let me know if the 300 number is incorrect.

Thx for the feedback on pricing. Will take it into account.

-----


sounds like evernote, if I'm mistaken, please enlighten me :)

-----


1) saving is automatic

2) privacy-first architecture (e.g., plaintext is never uploaded, plaintext URLs are never uploaded etc)

-----


ah, so it saves... everything? wow! that's nuts. I mean, where the heck are you going to store all of that?

and it is terribly inefficient to store say, 30,000 examples of the exact same article. or do you have a way to check and not store duplicates? if so, what if a blog post is saved today and has 10 comments and it is saved tomorrow by another person and it has 11 comments.

technically the page is different so it would be saved again

I think you need to explain exactly how it works a bit better or maybe I'm just not getting it

:)

-----


> ah, so it saves... everything? wow! that's nuts. I mean, where the heck are you going to store all of that?

Please see my reply to idlewords. It isn't literally everything; there are some heuristics to detect what was interesting to you.

I understand you might find this excessive. I routinely find that useful. :) See also "As We May Think" by Vannevar Bush.

> and it is terribly inefficient to store say, 30,000 examples of the exact same article ...

You're right: no deduping is and can be done across users (HMACs chained up the user's password are used create dedup hashes). Storage is sufficiently cheap that I feel it's a acceptable tradeoff vs. privacy (i.e., server being unable to confirm that two users have saved the same page).

-----


Please also look at https://www.purplerails.com/

I tried address the privacy angle from day 1. Data is encrypted and only ciphertext is sent to the cloud. Index is stored only on your own machine. Searching occurs on your own machine.

A couple of cool features are that it also saves an exact HTML copy of the page including images and stuff if you read a page "long enough" (currently 90 seconds).

Been in beta for a while. Thanks for feedback.

-----


Shameless plug: https://www.purplerails.com/

(1) Saves an exact copy of the page also.

(2) Indexes the text.

(3) Encrypted (search occurs on your computer).

Been in beta for a while. Thanks for feedback.

-----


Love the cartoon on your front page!

-----


Thank you for your kind words!

-----


Really nice utility. More developers should dip their toes into crypto and develop applications like this. :)

A comment in the code about why it's OK in your case to use the same key for MAC and encrypt would be useful. I think you're fine. See here: http://security.stackexchange.com/questions/37880/why-cant-i...

I needed to implement deduplication in my system. Since I controlled the server, I developed a slightly more elaborate system which doesn't have the limitation of a predictable IV (predictable from the encryption key).

So in my system, I derive two keys from the same passphrase (PBKDF2 with different salts). I encrypt as usual with unpredictable IVs. When uploading, the HMAC of the plaintext and SHA-256 of the ciphertext are both loaded.

To check for duplication, the client asks if a certain HMAC is already present. And it's an error (at the server) to upload multiple ciphertexts with the same HMAC.

-----


The vulnerability in that post is for using AES-CBC with AES-CBC-MAC with the same key. I'm using AES-CBC and HMAC-SHA512, which should be okay. The design was reviewed by cryptographers and was given a green light, plus I tried to use as little custom crypto as possible for this exact reason :)

-----


I believe one additional way to mitigate Risk #2 "Broken Authentication and Session Management" should become best practice:

The ability to sign out of all other sessions.

Without this a user who forgot to logout from a library would be out of luck until the session expired.

-----


Ditto. I'm waiting for the Haswell machines to come out. I'm considering the Thinkpad T440 or T440s (whichever falls within my budget). It's been previewed, but not available yet. No pricing info. http://shop.lenovo.com/us/en/laptops/thinkpad/t-series/t440/

-----


>Configure your T440 with an HD+ (1600 x 900 resolution) LCD display with high brightness and enjoy a premium visual experience. For enhanced navigation configure your T440 with 10-point multitouch.

That resolution is a dealbreaker.

-----


There is also a 1080p screen option coming in the next few months.

-----


They didn't even put indicative weight on the web page yet...

-----


Quick question: why do you need to normalize the origin? Is it ever passed in by the browser runtime un-normalized?

i..e, why couldn't you have written:

  if (e.origin !== 'https://clef.io') {

-----


Yep, that code was actually legacy code that ended up causing problems. I stripped it out :D, but only after I'd spent a few hours messing around with this.

-----


Thanks. I was not normalizing, and wanted make sure it was OK.

-----


Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: