You've seen the pictures of the NSA's new data storage facility in Utah. Maybe you've seen pictures of huge data repositories. We've seen data repositories in movies and so, it would be easy to get the idea that 1 billion phone calls per day would take up a full floor of one of these fictional depictions. The truth is phone call files aren't that big. None of these files combined are that big. In fact, you can configure a home computer (with ribbons to make masters and slaves) to hold 60 terabytes of data (15 x 4 = 60), not that you would; but you could do it. Of course the band width needed to take it all in is another story and it's a long one on the flip side.
The Size of the Data Doesn't Matter As Much as You'd Think
Chances are you think storing a billion phone calls a day, every day for years is unsustainable. It isn't. Ask your company's IT guy/gal if they have done any data storage calculations. They will hedge and add large margins for error.....maybe double or triple their estimates but storing the full audio of 1 billion 1 hour long phone calls in an MP3 format with 10:1 data compression is only about 115 gigabytes per day. Multiply it out by 365 days in a year and you're looking at about 41 terabytes of data per year. That's 11 of those Seagate drives I linked to above. Granted, the average phone call is highly likely to be shorter than an hour, but you get the idea. Documentation shows a collection of metadata which are much smaller files. If full audio is doable, metadata is definitely doable.
If you have trouble wrapping your mind around how to store 1 billion phone calls a day; the problem has nothing to do with hard drive space.
Tweets are limited to 140 characters or 140 bytes. Text messages, depending upon your cell phone carrier, are limited to 160 characters (or bytes). These are very small files. You could store 10 billion of them a day. 10 billion tweets and 10 billion text messages and still only have that add up to 1.5 terabytes per year. Twitter pics are links the picture stored elsewhere on the web and are limited to 20 characters of your 140 character limit.
It you have trouble conceiving why you'd want to store all these Tweets and Text messages, I concur; but the least of your worries is figuring out if your computer system has enough disk space to do so. It would take almost 3 years to fill one of these Western Digital drives.
Email is the big number. It's estimated that in the U.S. alone, about 144.8 billion emails ricochet across the country every day. This includes all those blanket emails your boss sends you and every other member of your team every day.....15 times a day. This includes all those forwards and those annoying "Reply All"s. It includes all those attachments we bounce back and forth. It's estimated that the average size of an email is 75 kilobytes. Say we decide that 75 kb average is way off the mark and bump it up to 200 kilobytes per email times 144.8 billion emails per day times 365 days per year and were at just under 10.6 terabytes per year (3 of those Seagate drives will do the trick). The metadata relating to email would be far smaller and far more doable.
The sheer magnitude and audacity of both capturing and storing the full contents of our emails and picture mails is daunting and I surely hope it ends men sending dick pics via their smart phones; but to be sure, the data storage end of it isn't the issue. The question should be about what assurances we have that the NSA stopped with metadata.
You might be thinking about all those pictures and videos we send. These are the biggest files and the higher resolutions make for huge files, but most of us send lower resolution over the net and phone. We keep the better resolutions on our home computers. I haven't found the averages of these files, but if it follows phone calls, Tweets and text messages; data storage isn't the issue. The question we should be thinking about is what photo hosting sites have been served with secret letters requiring them to turn over what type of information.
It's the vastness of the data scoop that is impossible to comprehend which is why the main stream media denies it's possibility, but the total amount of data adds up to about 52 terabytes per year for full content capture of phone calls, tweets, text messages and email. That's less than 15 of these Hitachi drives. The facts of the internet are that the data I describe here is a miniscule portion of what courses through it every day.
The Size and Distribution of the Network Matters - It Matters A Lot
I was talking to a group of IT managers, programmers and technicians about the NSA snooping. They too scoffed at the idea at first. Then someone mused aloud how they would handle their end of it, if they worked for the NSA. There was a long pause, then the table exploded with ideas. Distributed computer networking makes collecting this information quite possible and probable. The recent release of documentation states it is in fact happening.
Disclaimer: My hardware experience is a little old, so I did have some trouble following this conversation. I'm going to summarize the points I was able to take away from the table. I put in links to various computer terms where I had to brush up on my homework. The main take away is that all the information related to an internet "click" can (no doubt) be recorded.
Scooping the data as it flies across the internet, fiber optic apparatus, VOIP (Voice Over Internet Protocol) and phone towers could be accomplished utilizing any of a dozen solutions. (There was some argument about whether all land line communications are digitized with the argument coming down 75%/25% that it is.) The bottom line was that all digitized communication is reduced to a series of data packets and once data packets are created, they can be copied. So, the argument that the NSA "can't" record all phone, internet, email, text, tweet, Skype (etc.) content is bogus. It can be done. We have the technology to do it. We've likely had the technology to do it for over a decade.
A distributed computer network would place "collector nodes" near any data switching station desired. These nodes would be encoded to copy and forward every data packet that flows through that intercept point to one or more data repositories. It would be easiest to do if the government had the cooperation of every telecom, ISP and social media company that exists in the U.S. today to do it, but that's not that hard to imagine anymore; is it?
What I've had the biggest problem wrapping my head around is why I'd want all that data? What would I do with all that data? How much band width is needed to accept 160 billion data items coming in through some sort of VPN or SSL (Virtual Private Network or Secure Socket Layer) into an automated system that "bags and tags" each data packet into data bases and then places them into semi-permanent data storage every day. Your home computer system wouldn't be able to do that, but Google can. I bet Yahoo and Facebook can too. I'll go all in and bet they could handle more than 5 times that amount of traffic.
The Hardware and Software
The solution would be to accept the data through multiple VPN or SSL ports available throughout the country, but my mind boggles at the number of ports that would be necessary to accept 160 billion pieces of data every 24 hours with surges from time to time. The intercept points would have to have some temporary file storage but as stated before, disk storage really isn't a problem anymore; 4 drives holding three terabytes each would likely offer enough disk swap space to handle the traffic.
As that raw data continuously flows in, it has to be shadow copied regularly and backed up daily to maintain data integrity. The data base(s) need to be separated into discrete components. SQL is scalable, but even Oracle has limits of how much data can be stuffed into one bag. The issue would be how to arrange the total data scoop into an array of interrelated data files that would allow the user to seamlessly skim the files across a distributed computing environment that gathers specific information according to key word, data tag or some sort of identifier. We know the software it might have a hadoop kernel, but today we know it as Prism. The system would have to have an inherent redundancy of both server power and data storage. (This Utah site is only one storage site that we know about. It's highly likely that there are other data storage sites we don't know about.) To make this all come together in harmony would require some well orchestrated hardware which is available today to anyone with the deep enough pocket to buy it.
A data project this size would be based upon data structures used by say an on-line gaming website or maybe a political blog. The difference is scale and it would come down to the networking. To do that there would be an array of servers. Some would be under utilized and some would be over utilized. There would be separate servers that would even out the I/O loads across all the other servers; virtual traffic cops directing the flow of data packets (if you need an analogy).
As I understand it, a single physical machine, properly configured, can become a combination of 48 physical and virtual servers contained in one box. As the data load increases, you can add more servers to be linked together and it can be done virtually, so these machines don't have to be physically connected to each other. Each server can have 24-48 ports to service. Each port can service a lot of requests such that each server handles 100s of thousands of data requests every 24 hours. The data would reside in separate data storage servers that can be miles apart from where both the data servers and other data storage devices. One box can have up to 48 servers in it with each assigned to service it's separate load of data drives.
These server caches could be sprinkled across the nation in such a way that if one city lost power for an extended period of time, servers in other cities would pick up the load. (One box can have 192 gigabytes of RAM. We have solid state drives (no spin drives). The speed of our current hardware is breath taking to people who have actively worked with hardware for over 20 years.)
The bubble gum keeping it all together is a series of data switches whose primary function is to keep data packet requests and data packet replies flying "through the wires" without collisions which is done with the assistance of the traffic cop server. Granted a 160 billion data items is a lot of information, but if you think of all the data flowing throughout our digital networks; 160 billion is tiny compared to the total flow through our internet. Chances are when the fallout finishes, we'll find that the data scoop could easily be 5x 160 billion items per day or more.
12 Years Later
Chances are you remember where you were on 9/11/2001. Do you remember how the internet stopped working? Remember the momentary concern about the possible loss of financial data? News reports later reassured the markets that no data was lost in the destruction because the data was redundantly ensconced in New Jersey. Since 911, private telecoms and governments have had over 10 years to adapt the internet to prevent that type of crash from happening again and further improve redundant distributed computing techniques and storage. We've had more than 10 years to better distribute the traffic load. "The Cloud" existed then, it's much better now. Ten years is plenty time to install the surveillance apparatus needed to scoop up whatever data the NSA wants to store, but chances are equally likely that ABC, NBC, CBS, FOX, CNN or most newspaper or magazine editors will not feature an expert that explains this aspect of the story in terms most people can absorb. They could tour their own IT departments to illustrate how this technology works, but they don't. That would require some imagination. The full magnitude of how intrusive this surveillance system is boggles their mind. It's so boggling, that they deny it's reality. It's easier to focus on where's Waldo Edward Snowden and if he's a traitor or hero and when that doesn't work, attack the credibility of the reporters who did their job in telling the story.
The issue isn't if the data collection can be done, because it can be and is being done. How much data and to what extent data is captured and stored is debatable. Edward Snowden is peripheral to the story and I thank him for tipping me off to the extent of surveillance the NSA conducts throughout the world. As to Snowden's criminality, I totally understand why he doesn't want to be confined to a windowless cell 23/7 and I don't care if the U.S. ever catches him. My focus is on why James Clapper of the NSA lied, or prevaricated if you prefer, to Congress. Why does Diane Feinstein et al continue to lie or prevaricate if you prefer, that this extent of surveillance is necessary? Why is Obama doubling down on this surveillance? How is this data used? How are people targeted? Apparently, it didn't discover the Tsarnaev brothers in time to stop them.
Gathering and Storing as Much Data as the NSA Wants is Doable
How it's done isn't as important to me as why it's done and how this information is used. What's with all this secrecy? We have secret laws, secret interpretations of laws, a secret court, secret warrants, secret subpoenas, secret gag orders and no advocacy for people subjected to secret denials of their fourth amendment rights (they don't know about the attacks on their privacy, so they can't argue for their privacy). We have court proceedings so secret that people accused of crimes cannot be present while the evidence against them is discussed in court. This secrecy is aided and abetted by a judicial rubber stamp system that has no advocate for the defendant. We have no reporting about much of anything when it comes to FISA. Meanwhile, both the Executive branch and Congress just smiles and nods while our fourth amendment rights are shredded. Then, to appease our concerns, these same government players tell me that all this secrecy is supposed to keep me safe and it's better for me to not know about it. That's what I find to be questionable.