Delete duplicate entries in extremely big Database











up vote
1
down vote

favorite












Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT *  FROM  `App_info`  
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question
























  • Thanks for formatting it
    – Helwar
    Nov 13 at 14:56










  • I think you are missing Having COUNT(*) > 1 in the subquery
    – Madhur Bhaiya
    Nov 13 at 15:01















up vote
1
down vote

favorite












Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT *  FROM  `App_info`  
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question
























  • Thanks for formatting it
    – Helwar
    Nov 13 at 14:56










  • I think you are missing Having COUNT(*) > 1 in the subquery
    – Madhur Bhaiya
    Nov 13 at 15:01













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT *  FROM  `App_info`  
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question















Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT *  FROM  `App_info`  
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.







mysql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 14:55









Madhur Bhaiya

18.7k62236




18.7k62236










asked Nov 13 at 14:50









Helwar

479




479












  • Thanks for formatting it
    – Helwar
    Nov 13 at 14:56










  • I think you are missing Having COUNT(*) > 1 in the subquery
    – Madhur Bhaiya
    Nov 13 at 15:01


















  • Thanks for formatting it
    – Helwar
    Nov 13 at 14:56










  • I think you are missing Having COUNT(*) > 1 in the subquery
    – Madhur Bhaiya
    Nov 13 at 15:01
















Thanks for formatting it
– Helwar
Nov 13 at 14:56




Thanks for formatting it
– Helwar
Nov 13 at 14:56












I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01




I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01












2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










I suggest using a temporary table to store the duplicate IDs.




  1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


  2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


  3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


  4. If this is still too long, you can do them in batches from the temp table.



The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






share|improve this answer





















  • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
    – Helwar
    Nov 19 at 7:08


















up vote
1
down vote













Efficiently, you may utilize the INSERT IGNORE query. The steps:




  • Create a temporary table similar with similar schema to the existing
    table.


  • Add UNIQUE constraint to the columns we want.


  • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


  • Rename the original table to something else and rename the temporary
    table to the original table.


  • Drop the redundant table.


Hope this helps.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    I suggest using a temporary table to store the duplicate IDs.




    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.



    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer





















    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
      – Helwar
      Nov 19 at 7:08















    up vote
    1
    down vote



    accepted










    I suggest using a temporary table to store the duplicate IDs.




    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.



    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer





















    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
      – Helwar
      Nov 19 at 7:08













    up vote
    1
    down vote



    accepted







    up vote
    1
    down vote



    accepted






    I suggest using a temporary table to store the duplicate IDs.




    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.



    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer












    I suggest using a temporary table to store the duplicate IDs.




    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.



    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 13 at 15:53









    Cyrus

    1,0651610




    1,0651610












    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
      – Helwar
      Nov 19 at 7:08


















    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
      – Helwar
      Nov 19 at 7:08
















    Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
    – Helwar
    Nov 19 at 7:08




    Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
    – Helwar
    Nov 19 at 7:08












    up vote
    1
    down vote













    Efficiently, you may utilize the INSERT IGNORE query. The steps:




    • Create a temporary table similar with similar schema to the existing
      table.


    • Add UNIQUE constraint to the columns we want.


    • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


    • Rename the original table to something else and rename the temporary
      table to the original table.


    • Drop the redundant table.


    Hope this helps.






    share|improve this answer

























      up vote
      1
      down vote













      Efficiently, you may utilize the INSERT IGNORE query. The steps:




      • Create a temporary table similar with similar schema to the existing
        table.


      • Add UNIQUE constraint to the columns we want.


      • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


      • Rename the original table to something else and rename the temporary
        table to the original table.


      • Drop the redundant table.


      Hope this helps.






      share|improve this answer























        up vote
        1
        down vote










        up vote
        1
        down vote









        Efficiently, you may utilize the INSERT IGNORE query. The steps:




        • Create a temporary table similar with similar schema to the existing
          table.


        • Add UNIQUE constraint to the columns we want.


        • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


        • Rename the original table to something else and rename the temporary
          table to the original table.


        • Drop the redundant table.


        Hope this helps.






        share|improve this answer












        Efficiently, you may utilize the INSERT IGNORE query. The steps:




        • Create a temporary table similar with similar schema to the existing
          table.


        • Add UNIQUE constraint to the columns we want.


        • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


        • Rename the original table to something else and rename the temporary
          table to the original table.


        • Drop the redundant table.


        Hope this helps.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 at 16:04









        TeeKea

        1,88811223




        1,88811223






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

            ComboBox Display Member on multiple fields

            Is it possible to collect Nectar points via Trainline?