Reconstructing and files uploaded in SQL Server with python












2















I am working with a SQL Server database table similar to this



USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext


sample data:



USER_ID:      1
FILE_NAME: (AttachedFiles:1)=file1.pdf
FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….


Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:



content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"


My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:



content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(content_str))


...getting a TypeError: expected bytes-like object, not str



Investigating further, I found this other post and proceeded like this:



content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
encoded = content_str.encode('ascii')
with open(os.path.expanduser('test.pdf'), 'wb') as f:
f.write(base64.decodestring(encoded))


...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.



I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!










share|improve this question





























    2















    I am working with a SQL Server database table similar to this



    USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext


    sample data:



    USER_ID:      1
    FILE_NAME: (AttachedFiles:1)=file1.pdf
    FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….


    Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:



    content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"


    My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:



    content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
    with open(os.path.expanduser('test.pdf'), 'wb') as f:
    f.write(base64.decodestring(content_str))


    ...getting a TypeError: expected bytes-like object, not str



    Investigating further, I found this other post and proceeded like this:



    content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
    encoded = content_str.encode('ascii')
    with open(os.path.expanduser('test.pdf'), 'wb') as f:
    f.write(base64.decodestring(encoded))


    ...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.



    I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!










    share|improve this question



























      2












      2








      2








      I am working with a SQL Server database table similar to this



      USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext


      sample data:



      USER_ID:      1
      FILE_NAME: (AttachedFiles:1)=file1.pdf
      FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….


      Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:



      content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"


      My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:



      content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
      with open(os.path.expanduser('test.pdf'), 'wb') as f:
      f.write(base64.decodestring(content_str))


      ...getting a TypeError: expected bytes-like object, not str



      Investigating further, I found this other post and proceeded like this:



      content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
      encoded = content_str.encode('ascii')
      with open(os.path.expanduser('test.pdf'), 'wb') as f:
      f.write(base64.decodestring(encoded))


      ...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.



      I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!










      share|improve this question
















      I am working with a SQL Server database table similar to this



      USER_ID varchar(50), FILE_NAME ntext, FILE_CONTENT ntext


      sample data:



      USER_ID:      1
      FILE_NAME: (AttachedFiles:1)=file1.pdf
      FILE_CONTENT: (AttachedFiles:1)=H4sIAAAAAAAAAOy8VXQcy7Ku….


      Means regular expressions I have successfully isolated the "content" of the FILE_CONTENT field by removing the "(AttachedFiles:1)=" part resulting with a string similar to this:



      content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc…"


      My plan was to reconstruct the file using this string to download it from the database. During my investigation process, I found this post and proceeded to replicate the code like this:



      content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
      with open(os.path.expanduser('test.pdf'), 'wb') as f:
      f.write(base64.decodestring(content_str))


      ...getting a TypeError: expected bytes-like object, not str



      Investigating further, I found this other post and proceeded like this:



      content_str = 'H4sIAAAAAAAAAO19B0AUR/v33...'
      encoded = content_str.encode('ascii')
      with open(os.path.expanduser('test.pdf'), 'wb') as f:
      f.write(base64.decodestring(encoded))


      ...resulting as a successful creation of a PDF. However, when trying to open it, I get an error saying that the file is corrupt.



      I kindly ask you for any suggestions on how to proceed. I am even open to rethink the process I've came up with if necessary. Many thanks in advance!







      python sql-server






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 19 '18 at 19:56









      Tomalak

      257k51427545




      257k51427545










      asked Nov 19 '18 at 16:26









      DanielaCDanielaC

      111




      111
























          1 Answer
          1






          active

          oldest

          votes


















          0














          The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.



          import base64

          content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(base64.b64decode(content_str))


          The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.



          I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.



          If your PDF reader does not accept the data as is, decompress it before saving it to file:



          import gzip

          # ...

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(gzip.decompress(base64.b64decode(content_str)))





          share|improve this answer


























          • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

            – DanielaC
            Nov 20 '18 at 15:23











          • First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

            – Tomalak
            Nov 20 '18 at 15:29











          • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

            – DanielaC
            Nov 20 '18 at 15:58











          • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

            – Tomalak
            Nov 20 '18 at 16:02











          • Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

            – DanielaC
            Nov 20 '18 at 16:52











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53378886%2freconstructing-and-files-uploaded-in-sql-server-with-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.



          import base64

          content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(base64.b64decode(content_str))


          The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.



          I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.



          If your PDF reader does not accept the data as is, decompress it before saving it to file:



          import gzip

          # ...

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(gzip.decompress(base64.b64decode(content_str)))





          share|improve this answer


























          • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

            – DanielaC
            Nov 20 '18 at 15:23











          • First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

            – Tomalak
            Nov 20 '18 at 15:29











          • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

            – DanielaC
            Nov 20 '18 at 15:58











          • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

            – Tomalak
            Nov 20 '18 at 16:02











          • Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

            – DanielaC
            Nov 20 '18 at 16:52
















          0














          The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.



          import base64

          content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(base64.b64decode(content_str))


          The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.



          I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.



          If your PDF reader does not accept the data as is, decompress it before saving it to file:



          import gzip

          # ...

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(gzip.decompress(base64.b64decode(content_str)))





          share|improve this answer


























          • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

            – DanielaC
            Nov 20 '18 at 15:23











          • First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

            – Tomalak
            Nov 20 '18 at 15:29











          • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

            – DanielaC
            Nov 20 '18 at 15:58











          • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

            – Tomalak
            Nov 20 '18 at 16:02











          • Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

            – DanielaC
            Nov 20 '18 at 16:52














          0












          0








          0







          The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.



          import base64

          content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(base64.b64decode(content_str))


          The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.



          I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.



          If your PDF reader does not accept the data as is, decompress it before saving it to file:



          import gzip

          # ...

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(gzip.decompress(base64.b64decode(content_str)))





          share|improve this answer















          The value of the FILE_CONTENT is base64-encoded. This means it's a string consisting of 64 possible characters which represent raw bytes. All you need to do is base64-decode the string and write the resulting bytes directly to a file.



          import base64

          content_str = "H4sIAAAAAAAAAOy8VXQcy7Ku22JmZmZmspiZGS2WLGa0xc=="

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(base64.b64decode(content_str))


          The base64 sequence "H4sI" at the start of your content string translates to the bytes 0x1f, 0x8b, 0x08. These bytes are not normally at the start of a PDF file, but indicate a gzip-compressed data stream. It's possible that a PDF reader won't understand this.



          I don't know for certain if gzip compression is a valid part of the PDF file format, but it's a valid part of web communication, so maybe the file stream was compressed for transfer/download and has not been decompressed before writing it to the database.



          If your PDF reader does not accept the data as is, decompress it before saving it to file:



          import gzip

          # ...

          with open(os.path.expanduser('test.pdf'), 'wb') as fp:
          fp.write(gzip.decompress(base64.b64decode(content_str)))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 19 '18 at 20:28

























          answered Nov 19 '18 at 20:16









          TomalakTomalak

          257k51427545




          257k51427545













          • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

            – DanielaC
            Nov 20 '18 at 15:23











          • First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

            – Tomalak
            Nov 20 '18 at 15:29











          • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

            – DanielaC
            Nov 20 '18 at 15:58











          • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

            – Tomalak
            Nov 20 '18 at 16:02











          • Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

            – DanielaC
            Nov 20 '18 at 16:52



















          • Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

            – DanielaC
            Nov 20 '18 at 15:23











          • First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

            – Tomalak
            Nov 20 '18 at 15:29











          • I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

            – DanielaC
            Nov 20 '18 at 15:58











          • Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

            – Tomalak
            Nov 20 '18 at 16:02











          • Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

            – DanielaC
            Nov 20 '18 at 16:52

















          Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

          – DanielaC
          Nov 20 '18 at 15:23





          Thanks Tomalak! I tried your suggestion but I am now getting an "EOFError: Compressed file ended before the end-of-stream marker was reached" When investigating further, I reached some threads suggesting that the error is due to file corruption. Any further suggestions would be much appreciated.

          – DanielaC
          Nov 20 '18 at 15:23













          First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

          – Tomalak
          Nov 20 '18 at 15:29





          First, try to write the stream to file without passing it through gzip.decompress(). Then try to open the resulting file with your PDF reader, just to check the off-chance that it knows what to do. If it complains, try opening the resulting file in 7zip (which can deal with all kinds of compression formats) to find out if there is anything in it at all. Maybe gzip.decompress() is not the right tool yet, it was an educated guess of mine.

          – Tomalak
          Nov 20 '18 at 15:29













          I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

          – DanielaC
          Nov 20 '18 at 15:58





          I created a pdf without gzip.decompress() and failed to open it in the reader. I proceeded to change the extension of the pdf to .zip, .rar, .7z and failed to extract with 7zip. However when decompressing the .gzip the error I get is "Unexpected end of data". Thanks again!

          – DanielaC
          Nov 20 '18 at 15:58













          Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

          – Tomalak
          Nov 20 '18 at 16:02





          Can you upload the file you currently have somewhere? I can try and take a look at it, maybe I can figure something out. No promises though.

          – Tomalak
          Nov 20 '18 at 16:02













          Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

          – DanielaC
          Nov 20 '18 at 16:52





          Thank you so much Tomalak! On my github now: github.com/dcct84/encodedfiles_test

          – DanielaC
          Nov 20 '18 at 16:52


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53378886%2freconstructing-and-files-uploaded-in-sql-server-with-python%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to change which sound is reproduced for terminal bell?

          Can I use Tabulator js library in my java Spring + Thymeleaf project?

          Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents