{"id":7270,"date":"2016-10-17T14:08:24","date_gmt":"2016-10-17T12:08:24","guid":{"rendered":"https:\/\/blog.redbaronofazure.com\/?p=7270"},"modified":"2016-10-17T14:32:24","modified_gmt":"2016-10-17T12:32:24","slug":"put-block-apis-and-c","status":"publish","type":"post","link":"https:\/\/blog.redbaronofazure.com\/?p=7270","title":{"rendered":"PutBlock APIs and C++"},"content":{"rendered":"<p>Ingesting data into storage may be a non exciting task but it is a task that you get back to again and again. There are many tools for uploading files to Azure Storage and\u00a0there are also many SDKs so you can implement it in most popular languages. I&#8217;ve written how you can use the PutBlock APIs in both <a href=\"https:\/\/blog.redbaronofazure.com\/?p=1\">Java<\/a> and <a href=\"https:\/\/blog.redbaronofazure.com\/?p=6791\">javascript<\/a> before and recently I had the chance to implement it i C++ to create a portable solution.<\/p>\n<p><strong>PutBlock and PutBlockList APIs<\/strong><\/p>\n<p>The Azure Storage APIs PutBlock and PutBlockList are very basic but also very powerful. With PutBlock you upload a chunk of a file which you then repeat until all of the file is uploaded. You commit all of the chunks in one call to PutBlockList which takes all the chunk ids and their respective order. The beauty is that you can upload a\u00a0large files in parallell by using multiple threads, there by avoiding the\u00a0sequential processing of first reading a chunk and sending if over the network, then read some more, etc etc. How many threads and how large the chunk size should be may vary from machine to machine, but the goal is to maximize the utilization of the network card.<\/p>\n<p><strong>Azure Storage C++ library<\/strong><\/p>\n<p>Microsoft do not have full a complete C++ Azure SDK as they have for C#, Java and other languages, but there is a C++ library for Azure Storage called <a href=\"https:\/\/github.com\/azure\/azure-storage-cpp\">azure-storage-cpp<\/a>. In a <a href=\"https:\/\/blog.redbaronofazure.com\/?p=7232\">previous post<\/a> I wrote about how to leverage that and another Microsoft C++ REST library named <a href=\"https:\/\/github.com\/microsoft\/cpprestsdk\">Casablanca<\/a> for implementing Azure KeyVault functionality in C++. The sample program I develop in this post builds on the same principles, so instructions for how to build azure-storage-cpp and Casablanca will not be repeated here.<\/p>\n<p><strong>Parallell uploading Sample<\/strong><\/p>\n<p>Each chunk of the file holds the starting offset in the file, the length of the chunk and its identity.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true\">class FileChunk\r\n{\r\npublic:\r\n\tint id;                                             \/\/ seq id of the chunk to read\r\n\tunsigned long startpos;                             \/\/ offset in file where to start reading\r\n\tunsigned long length;                               \/\/ length of chunk to read from file\r\n\tint threadid;                                       \/\/ marked by the thread that pulls the piece from the queue\r\n\tbool completed;                                     \/\/ marked by the thread when completed\r\n\tunsigned long bytesread;                            \/\/ actual bytes read from file\r\n\tfloat seconds;                                      \/\/ time it took to send this chunk to Azure Storage\r\n\tutility::string_t block_id;                         \/\/ BlockId for Azure\r\n<\/pre>\n<p>The list of chunks is stored twice. First in a list so we can keep track of them, and then also in a queue that serves as a way to feed all background threads with work to do.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true \">std::list&lt;FileChunk*&gt; chunkl;\r\nstd::queue&lt;FileChunk*&gt; queueChunks; \r\n<\/pre>\n<p>Before creating the threads, we create a list of FileChunk objects and populate the list and queue.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true\">\/\/ create chunks and push them on a queue\r\nwhile (remaining &gt; 0)\r\n{\r\n    chunksread++;\r\n    long toread = remaining &gt; this-&gt;chunkSize ? this-&gt;chunkSize : remaining;\r\n    FileChunk *fc = new FileChunk(chunksread, (unsigned long)currpos, (unsigned long)toread);\r\n    chunkl.push_back(fc);\r\n    this-&gt;queueChunks.push(fc);\r\n    remaining -= toread;\r\n    currpos += toread;\r\n}<\/pre>\n<p>Then it&#8217;s time to create the threads and put them to work.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true\">\/\/ create threads that process tasks in the queue\r\nstd::list&lt;std::thread*&gt; vt;\r\nfor (int n = 1; n &lt;= countThreads; n++)\r\n{\r\n    std::thread *t1 = new std::thread( threadproc, n, this );\r\n    vt.push_back(t1);\r\n}<\/pre>\n<p>The background processing is pretty simple. It just pulls the next chunk to be processed off the queue, reads the file as specified in the FileChunk info, creates a BlockId and sends the chunk to\u00a0Azure Storage via the PutBlock API, which in the C++ library is called upload_block. Since it is a common queue, we can use as many threads as we want, but there is really no point in having to many since the bottleneck will be throughput on the network cards.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true\">azure::storage::cloud_block_blob blob = this-&gt;container.get_block_blob_reference( this-&gt;blobName );\r\nstd::vector&lt;uint8_t&gt; buffer( this-&gt;chunkSize );\r\n\/\/ get the next file I\/O task from hte queue and read that chunk\r\nwhile (!this-&gt;queueChunks.empty())\r\n{\r\n    FileChunk *fc = (FileChunk*)(this-&gt;queueChunks.front());\r\n    this-&gt;queueChunks.pop();\r\n    \/\/ read the specified chunk from the file\r\n    file.seekg(fc-&gt;startpos, ios::beg);\r\n    file.read((char*)&amp;buffer[0], fc-&gt;length);\r\n    fc-&gt;bytesread = (unsigned long)file.gcount();\r\n    \/\/ create Azure Block ID value\r\n    fc-&gt;block_id = utility::conversions::to_base64(fc-&gt;id);\r\n    auto stream = concurrency::streams::bytestream::open_istream(buffer);\r\n    utility::string_t md5 = _XPLATSTR(\"\");\r\n    unsigned long t0 = clock();\r\n    blob.upload_block(fc-&gt;block_id, stream, md5);\r\n    fc-&gt;seconds = (float)(clock() - t0) \/ (float)CLOCKS_PER_SEC;\r\n    fc-&gt;threadid = threadid;\r\n    fc-&gt;completed = true;\r\n}\r\n<\/pre>\n<p>When the queue is empty, the threads terminate and the main program continues with the next step which is building a list of the BlockIds and calling the final commit to the PutBlockList API. This method is named upload_block_list in the C++ library.<\/p>\n<pre class=\"theme:vs2012-black lang:c++ decode:true \">\/\/ wait for all threads to complete\r\nstd::list&lt;std::thread*&gt;::iterator itt;\r\nfor (itt = vt.begin(); itt != vt.end(); ++itt)\r\n{\r\n    (*itt)-&gt;join();\r\n}\r\n\/\/ create the block list vector from results\r\nthis-&gt;total_bytes = 0;\r\nstd::vector&lt;azure::storage::block_list_item&gt; vbi;\r\nstd::list&lt;FileChunk*&gt;::iterator it;\r\nfor (it = chunkl.begin(); it != chunkl.end(); ++it)\r\n{\r\n    azure::storage::block_list_item *bli = new azure::storage::block_list_item((*it)-&gt;block_id);\r\n    vbi.push_back(*bli);\r\n    this-&gt;total_bytes += (*it)-&gt;bytesread;\r\n    delete (*it);\r\n}\r\n\/\/ commit the block list items to Azure Storage\r\nblob1.upload_block_list(vbi);\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Running the program<\/strong><\/p>\n<p>To specify the storage account to be used, you can either pass the account name and key on the command line or you can set it as environment variables called STORAGE_ACCOUNT_NAME and STORAGE_ACCESS_KEY. Then you pass the local filename and the name of the Storage Container on the command line to execute the program. The below is output from running it on a D1 VM with 4 threads uploading a file of 1GB. It&#8217;s a pretty impressive performance. To get even better perf, you would need to create a VM in Azure with multiple NICs.<\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7268\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run.png\" alt=\"cppblockupload_run\" width=\"1426\" height=\"306\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run.png 1426w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run-300x64.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run-768x165.png 768w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_run-1024x220.png 1024w\" sizes=\"(max-width: 1426px) 100vw, 1426px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax.png\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-7269\" src=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax.png\" alt=\"cppblockupload_syntax\" width=\"1426\" height=\"752\" srcset=\"https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax.png 1426w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax-300x158.png 300w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax-768x405.png 768w, https:\/\/blog.redbaronofazure.com\/wp-content\/uploads\/2016\/10\/cppblockupload_syntax-1024x540.png 1024w\" sizes=\"(max-width: 1426px) 100vw, 1426px\" \/><\/a><\/p>\n<p><strong>Building the program<\/strong><\/p>\n<p>The C++ program can be built on Windows, Linux and Mac. The github repo contains a Visual Studio 2015 solution and a makefile that you can use on Linux\/Mac. On Windows, installing the azure-storage-cpp is automatic via NuGet (or if you do Install_package wastorage yourself). On Linux\/Mac, you have do git clone and build Casablanca and azure-storage-cpp manially.<\/p>\n<p><strong>Summary<\/strong><\/p>\n<p>The PutBlock\/PutBlockList APIs are quite versatile in that you can build data ingestion solutions in almost all popular languages. With the C++ library solution, you can build your own tools that can run on a wide array of machines and devices. C++ might not be the language that you write a lot of code in today, but if you need raw performance and portability, you can interact with Azure Storage quite easily and achieve good performance.<\/p>\n<p><strong>References<\/strong><\/p>\n<p>Github repo &#8211; Source code<br \/>\n<a href=\"https:\/\/github.com\/cljung\/azblockupload\">https:\/\/github.com\/cljung\/azblockupload<\/a><\/p>\n<p>MSDN documentation &#8211; PutBlock APIs<br \/>\n<a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dd135726.aspx\">https:\/\/msdn.microsoft.com\/en-us\/library\/dd135726.aspx<\/a><\/p>\n<p>Github \u2013 azure-storage-cpp<br \/>\n<a href=\"https:\/\/github.com\/Azure\/azure-storage-cpp\">https:\/\/github.com\/Azure\/azure-storage-cpp<\/a><\/p>\n<p>Github \u2013 Casablanca C++ REST SDK<br \/>\n<a href=\"https:\/\/github.com\/microsoft\/cpprestsdk\">https:\/\/github.com\/microsoft\/cpprestsdk<br \/>\n<\/a>Make sure to look at the wiki page for documentation on how to build it<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ingesting data into storage may be a non exciting task but it is a task that you get back to again and again. There are many tools for uploading files to Azure Storage and\u00a0there are also many SDKs so you can implement it in most popular languages. I&#8217;ve written how you can use the PutBlock [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[405,131,121,321],"tags":[407,31],"_links":{"self":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7270"}],"collection":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7270"}],"version-history":[{"count":4,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7270\/revisions"}],"predecessor-version":[{"id":7274,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=\/wp\/v2\/posts\/7270\/revisions\/7274"}],"wp:attachment":[{"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.redbaronofazure.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}