Delete duplicate entries in extremely big Database

up vote
1
down vote

favorite

Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.

I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.

In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.

I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.

DELETE FROM app_info

WHERE  `idinfo` NOT IN (SELECT minid

                        FROM   (SELECT Min(`idinfo`) AS minid

                                FROM   app_info

                                GROUP  BY `when`,

                                          `idbooth`,

                                          `iddongle`,

                                          `typeinfo`,

                                          `money`,

                                          `money2`,

                                          `currency`,

                                          `stock`,

                                          `i1`,

                                          `i2`,

                                          `i3`,

                                          `i4`,

                                          `i5`,

                                          `str1`,

                                          `str2`,

                                          `pbnew`,

                                          `in1`,

                                          `in2`,

                                          `in3`,

                                          `in4`,

                                          `in5`,

                                          `in6`,

                                          `in7`,

                                          `in8`) e)

       AND `idinfo` < 1545000

       AND `idinfo` > 1541500;

This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.

I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.

I have tried, making the list it's easy, I think:

SELECT *  FROM  `App_info`  

WHERE  `when` >  '2018-11-05' 

GROUP BY  `typeInfo` ,  `str2`  

ORDER BY  `App_info`.`when` ASC

But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.

Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.

I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

Thanks for formatting it
– Helwar
Nov 13 at 14:56

I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01

add a comment |

up vote
1
down vote

favorite

I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.

In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.

I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.

DELETE FROM app_info

WHERE  `idinfo` NOT IN (SELECT minid

                        FROM   (SELECT Min(`idinfo`) AS minid

                                FROM   app_info

                                GROUP  BY `when`,

                                          `idbooth`,

                                          `iddongle`,

                                          `typeinfo`,

                                          `money`,

                                          `money2`,

                                          `currency`,

                                          `stock`,

                                          `i1`,

                                          `i2`,

                                          `i3`,

                                          `i4`,

                                          `i5`,

                                          `str1`,

                                          `str2`,

                                          `pbnew`,

                                          `in1`,

                                          `in2`,

                                          `in3`,

                                          `in4`,

                                          `in5`,

                                          `in6`,

                                          `in7`,

                                          `in8`) e)

       AND `idinfo` < 1545000

       AND `idinfo` > 1541500;

I have tried, making the list it's easy, I think:

SELECT *  FROM  `App_info`  

WHERE  `when` >  '2018-11-05' 

GROUP BY  `typeInfo` ,  `str2`  

ORDER BY  `App_info`.`when` ASC

Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

Thanks for formatting it
– Helwar
Nov 13 at 14:56

I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01

add a comment |

up vote
1
down vote

favorite

I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.

In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.

I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.

DELETE FROM app_info

WHERE  `idinfo` NOT IN (SELECT minid

                        FROM   (SELECT Min(`idinfo`) AS minid

                                FROM   app_info

                                GROUP  BY `when`,

                                          `idbooth`,

                                          `iddongle`,

                                          `typeinfo`,

                                          `money`,

                                          `money2`,

                                          `currency`,

                                          `stock`,

                                          `i1`,

                                          `i2`,

                                          `i3`,

                                          `i4`,

                                          `i5`,

                                          `str1`,

                                          `str2`,

                                          `pbnew`,

                                          `in1`,

                                          `in2`,

                                          `in3`,

                                          `in4`,

                                          `in5`,

                                          `in6`,

                                          `in7`,

                                          `in8`) e)

       AND `idinfo` < 1545000

       AND `idinfo` > 1541500;

I have tried, making the list it's easy, I think:

SELECT *  FROM  `App_info`  

WHERE  `when` >  '2018-11-05' 

GROUP BY  `typeInfo` ,  `str2`  

ORDER BY  `App_info`.`when` ASC

Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.

In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.

I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.

DELETE FROM app_info

WHERE  `idinfo` NOT IN (SELECT minid

                        FROM   (SELECT Min(`idinfo`) AS minid

                                FROM   app_info

                                GROUP  BY `when`,

                                          `idbooth`,

                                          `iddongle`,

                                          `typeinfo`,

                                          `money`,

                                          `money2`,

                                          `currency`,

                                          `stock`,

                                          `i1`,

                                          `i2`,

                                          `i3`,

                                          `i4`,

                                          `i5`,

                                          `str1`,

                                          `str2`,

                                          `pbnew`,

                                          `in1`,

                                          `in2`,

                                          `in3`,

                                          `in4`,

                                          `in5`,

                                          `in6`,

                                          `in7`,

                                          `in8`) e)

       AND `idinfo` < 1545000

       AND `idinfo` > 1541500;

I have tried, making the list it's easy, I think:

SELECT *  FROM  `App_info`  

WHERE  `when` >  '2018-11-05' 

GROUP BY  `typeInfo` ,  `str2`  

ORDER BY  `App_info`.`when` ASC

Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.

mysql

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

edited Nov 13 at 14:55

Madhur Bhaiya

18.7k62236

asked Nov 13 at 14:50

Helwar

479

asked Nov 13 at 14:50

Helwar

479

asked Nov 13 at 14:50

Helwar

479

Thanks for formatting it
– Helwar
Nov 13 at 14:56

I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01

add a comment |

Thanks for formatting it
– Helwar
Nov 13 at 14:56

I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01

Thanks for formatting it
– Helwar
Nov 13 at 14:56

I think you are missing Having COUNT(*) > 1 in the subquery
– Madhur Bhaiya
Nov 13 at 15:01

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

I suggest using a temporary table to store the duplicate IDs.

Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.

Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.

In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).

If this is still too long, you can do them in batches from the temp table.

The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.

answered Nov 13 at 15:53

Cyrus

1,0651610

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

add a comment |

up vote
1
down vote

Efficiently, you may utilize the INSERT IGNORE query. The steps:

Create a temporary table similar with similar schema to the existing
table.

Add UNIQUE constraint to the columns we want.

Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).

Rename the original table to something else and rename the temporary
table to the original table.

Drop the redundant table.

Hope this helps.

answered Nov 13 at 16:04

TeeKea

1,88811223

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

I suggest using a temporary table to store the duplicate IDs.

Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.

Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.

In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).

If this is still too long, you can do them in batches from the temp table.

The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.

answered Nov 13 at 15:53

Cyrus

1,0651610

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

add a comment |

up vote
1
down vote

accepted

I suggest using a temporary table to store the duplicate IDs.

Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.

Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.

In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).

If this is still too long, you can do them in batches from the temp table.

The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.

answered Nov 13 at 15:53

Cyrus

1,0651610

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

add a comment |

up vote
1
down vote

accepted

I suggest using a temporary table to store the duplicate IDs.

Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.

Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.

In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).

If this is still too long, you can do them in batches from the temp table.

The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.

answered Nov 13 at 15:53

Cyrus

1,0651610

I suggest using a temporary table to store the duplicate IDs.

Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.

Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.

In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).

If this is still too long, you can do them in batches from the temp table.

The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.

answered Nov 13 at 15:53

Cyrus

1,0651610

answered Nov 13 at 15:53

Cyrus

1,0651610

answered Nov 13 at 15:53

Cyrus

1,0651610

answered Nov 13 at 15:53

Cyrus

1,0651610

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

add a comment |

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08

add a comment |

up vote
1
down vote

Efficiently, you may utilize the INSERT IGNORE query. The steps:

Create a temporary table similar with similar schema to the existing
table.

Add UNIQUE constraint to the columns we want.

Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).

Rename the original table to something else and rename the temporary
table to the original table.

Drop the redundant table.

Hope this helps.

answered Nov 13 at 16:04

TeeKea

1,88811223

add a comment |

up vote
1
down vote

Efficiently, you may utilize the INSERT IGNORE query. The steps:

Create a temporary table similar with similar schema to the existing
table.

Add UNIQUE constraint to the columns we want.

Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).

Rename the original table to something else and rename the temporary
table to the original table.

Drop the redundant table.

Hope this helps.

answered Nov 13 at 16:04

TeeKea

1,88811223

add a comment |

up vote
1
down vote

Efficiently, you may utilize the INSERT IGNORE query. The steps:

Create a temporary table similar with similar schema to the existing
table.

Add UNIQUE constraint to the columns we want.

Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).

Rename the original table to something else and rename the temporary
table to the original table.

Drop the redundant table.

Hope this helps.

answered Nov 13 at 16:04

TeeKea

1,88811223

Efficiently, you may utilize the INSERT IGNORE query. The steps:

Create a temporary table similar with similar schema to the existing
table.

Add UNIQUE constraint to the columns we want.

Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).

Rename the original table to something else and rename the temporary
table to the original table.

Drop the redundant table.

Hope this helps.

answered Nov 13 at 16:04

TeeKea

1,88811223

answered Nov 13 at 16:04

TeeKea

1,88811223

answered Nov 13 at 16:04

TeeKea

1,88811223

answered Nov 13 at 16:04

TeeKea

1,88811223

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrgtkky