Delete duplicate entries in extremely big Database
up vote
1
down vote
favorite
Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.
I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.
In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.
I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.
DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;
This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.
I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.
I have tried, making the list it's easy, I think:
SELECT * FROM `App_info`
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC
But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.
Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.
I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.
mysql
add a comment |
up vote
1
down vote
favorite
Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.
I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.
In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.
I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.
DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;
This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.
I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.
I have tried, making the list it's easy, I think:
SELECT * FROM `App_info`
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC
But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.
Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.
I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.
mysql
Thanks for formatting it
– Helwar
Nov 13 at 14:56
I think you are missingHaving COUNT(*) > 1
in the subquery
– Madhur Bhaiya
Nov 13 at 15:01
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.
I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.
In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.
I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.
DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;
This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.
I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.
I have tried, making the list it's easy, I think:
SELECT * FROM `App_info`
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC
But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.
Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.
I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.
mysql
Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.
I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.
In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.
I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.
DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;
This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.
I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.
I have tried, making the list it's easy, I think:
SELECT * FROM `App_info`
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC
But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.
Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.
I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.
mysql
mysql
edited Nov 13 at 14:55
Madhur Bhaiya
18.7k62236
18.7k62236
asked Nov 13 at 14:50
Helwar
479
479
Thanks for formatting it
– Helwar
Nov 13 at 14:56
I think you are missingHaving COUNT(*) > 1
in the subquery
– Madhur Bhaiya
Nov 13 at 15:01
add a comment |
Thanks for formatting it
– Helwar
Nov 13 at 14:56
I think you are missingHaving COUNT(*) > 1
in the subquery
– Madhur Bhaiya
Nov 13 at 15:01
Thanks for formatting it
– Helwar
Nov 13 at 14:56
Thanks for formatting it
– Helwar
Nov 13 at 14:56
I think you are missing
Having COUNT(*) > 1
in the subquery– Madhur Bhaiya
Nov 13 at 15:01
I think you are missing
Having COUNT(*) > 1
in the subquery– Madhur Bhaiya
Nov 13 at 15:01
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
I suggest using a temporary table to store the duplicate IDs.
Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.
Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.
In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).
If this is still too long, you can do them in batches from the temp table.
The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
add a comment |
up vote
1
down vote
Efficiently, you may utilize the INSERT IGNORE
query. The steps:
Create a temporary table similar with similar schema to the existing
table.Add
UNIQUE
constraint to the columns we want.Run
INSERT IGNORE
to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate theUNIQUE
constraint (which was already ignored).Rename the original table to something else and rename the temporary
table to the original table.- Drop the redundant table.
Hope this helps.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
I suggest using a temporary table to store the duplicate IDs.
Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.
Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.
In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).
If this is still too long, you can do them in batches from the temp table.
The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
add a comment |
up vote
1
down vote
accepted
I suggest using a temporary table to store the duplicate IDs.
Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.
Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.
In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).
If this is still too long, you can do them in batches from the temp table.
The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
I suggest using a temporary table to store the duplicate IDs.
Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.
Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.
In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).
If this is still too long, you can do them in batches from the temp table.
The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.
I suggest using a temporary table to store the duplicate IDs.
Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.
Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.
In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).
If this is still too long, you can do them in batches from the temp table.
The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.
answered Nov 13 at 15:53
Cyrus
1,0651610
1,0651610
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
add a comment |
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.
– Helwar
Nov 19 at 7:08
add a comment |
up vote
1
down vote
Efficiently, you may utilize the INSERT IGNORE
query. The steps:
Create a temporary table similar with similar schema to the existing
table.Add
UNIQUE
constraint to the columns we want.Run
INSERT IGNORE
to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate theUNIQUE
constraint (which was already ignored).Rename the original table to something else and rename the temporary
table to the original table.- Drop the redundant table.
Hope this helps.
add a comment |
up vote
1
down vote
Efficiently, you may utilize the INSERT IGNORE
query. The steps:
Create a temporary table similar with similar schema to the existing
table.Add
UNIQUE
constraint to the columns we want.Run
INSERT IGNORE
to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate theUNIQUE
constraint (which was already ignored).Rename the original table to something else and rename the temporary
table to the original table.- Drop the redundant table.
Hope this helps.
add a comment |
up vote
1
down vote
up vote
1
down vote
Efficiently, you may utilize the INSERT IGNORE
query. The steps:
Create a temporary table similar with similar schema to the existing
table.Add
UNIQUE
constraint to the columns we want.Run
INSERT IGNORE
to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate theUNIQUE
constraint (which was already ignored).Rename the original table to something else and rename the temporary
table to the original table.- Drop the redundant table.
Hope this helps.
Efficiently, you may utilize the INSERT IGNORE
query. The steps:
Create a temporary table similar with similar schema to the existing
table.Add
UNIQUE
constraint to the columns we want.Run
INSERT IGNORE
to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate theUNIQUE
constraint (which was already ignored).Rename the original table to something else and rename the temporary
table to the original table.- Drop the redundant table.
Hope this helps.
answered Nov 13 at 16:04
TeeKea
1,88811223
1,88811223
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Thanks for formatting it
– Helwar
Nov 13 at 14:56
I think you are missing
Having COUNT(*) > 1
in the subquery– Madhur Bhaiya
Nov 13 at 15:01