What is effective way to delete duplicate records based on multiple columns in MySQL? The 2019...

Protecting Dualbooting Windows from dangerous code (like rm -rf)

Is domain driven design an anti-SQL pattern?

What tool would a Roman-age civilization have to grind silver and other metals into dust?

How to make payment on the internet without leaving a money trail?

What is this 4-propeller plane?

Understanding the implication of what "well-defined" means for the operation in quotient group

What does "rabbited" mean/imply in this sentence?

What is the steepest angle that a canal can be traversable without locks?

If the Wish spell is used to duplicate the effect of Simulacrum, are existing duplicates destroyed?

aging parents with no investments

Is "plugging out" electronic devices an American expression?

Why is it "Tumoren" and not "Tumore"?

Output the Arecibo Message

How to create dashed lines/arrows in Illustrator

I looked up a future colleague on LinkedIn before I started a job. I told my colleague about it and he seemed surprised. Should I apologize?

What are the motivations for publishing new editions of an existing textbook, beyond new discoveries in a field?

Is there a name of the flying bionic bird?

In microwave frequencies, do you use a circulator when you need a (near) perfect diode?

I see my dog run

Are there any other methods to apply to solving simultaneous equations?

Landlord wants to switch my lease to a "Land contract" to "get back at the city"

Does it makes sense to buy a new cycle to learn riding?

How to answer pointed "are you quitting" questioning when I don't want them to suspect

Falsification in Math vs Science



What is effective way to delete duplicate records based on multiple columns in MySQL?



The 2019 Stack Overflow Developer Survey Results Are InUnexplained InnoDB timeoutsAdding index to large mysql tablesSimple query is slow on 4M-rows tableOptimizing a simple query on a large tableHow to improve query count execution with mySql replicate?MySQL query taking too longNeed help in writing stored procedures in MYSQL?slow queries on indexed columns (large datasets)Joining mysql tables





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







2















I have a table named users having approx 70 Million records, out of which I guess 10 Million is duplicate. How to remove those 10 Million records keeping one copy of that.

Two records are said duplicate if both have same email, alt_email, mobile and alt_mobile.

I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.



This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.



CREATE TABLE users
(
id int(11) NOT NULL AUTO_INCREMENT,
email varchar(200) DEFAULT NULL,
alt_email varchar(200) DEFAULT NULL,
mobile varchar(100) DEFAULT NULL,
alt_mobile varchar(100) DEFAULT NULL,
PRIMARY KEY (id),
KEY mobile (mobile),
KEY email (email)
) ENGINE=InnoDB;









share|improve this question































    2















    I have a table named users having approx 70 Million records, out of which I guess 10 Million is duplicate. How to remove those 10 Million records keeping one copy of that.

    Two records are said duplicate if both have same email, alt_email, mobile and alt_mobile.

    I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.



    This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.



    CREATE TABLE users
    (
    id int(11) NOT NULL AUTO_INCREMENT,
    email varchar(200) DEFAULT NULL,
    alt_email varchar(200) DEFAULT NULL,
    mobile varchar(100) DEFAULT NULL,
    alt_mobile varchar(100) DEFAULT NULL,
    PRIMARY KEY (id),
    KEY mobile (mobile),
    KEY email (email)
    ) ENGINE=InnoDB;









    share|improve this question



























      2












      2








      2


      2






      I have a table named users having approx 70 Million records, out of which I guess 10 Million is duplicate. How to remove those 10 Million records keeping one copy of that.

      Two records are said duplicate if both have same email, alt_email, mobile and alt_mobile.

      I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.



      This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.



      CREATE TABLE users
      (
      id int(11) NOT NULL AUTO_INCREMENT,
      email varchar(200) DEFAULT NULL,
      alt_email varchar(200) DEFAULT NULL,
      mobile varchar(100) DEFAULT NULL,
      alt_mobile varchar(100) DEFAULT NULL,
      PRIMARY KEY (id),
      KEY mobile (mobile),
      KEY email (email)
      ) ENGINE=InnoDB;









      share|improve this question
















      I have a table named users having approx 70 Million records, out of which I guess 10 Million is duplicate. How to remove those 10 Million records keeping one copy of that.

      Two records are said duplicate if both have same email, alt_email, mobile and alt_mobile.

      I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.



      This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.



      CREATE TABLE users
      (
      id int(11) NOT NULL AUTO_INCREMENT,
      email varchar(200) DEFAULT NULL,
      alt_email varchar(200) DEFAULT NULL,
      mobile varchar(100) DEFAULT NULL,
      alt_mobile varchar(100) DEFAULT NULL,
      PRIMARY KEY (id),
      KEY mobile (mobile),
      KEY email (email)
      ) ENGINE=InnoDB;






      mysql innodb mysql-5.5 duplication






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 1 '16 at 19:00







      Kamal Nayan

















      asked Jul 1 '16 at 15:31









      Kamal NayanKamal Nayan

      12310




      12310






















          2 Answers
          2






          active

          oldest

          votes


















          3














          Plan A:



          ALTER IGNORE TABLE users
          ADD UNIQUE(email, alt_email, mobile and alt_mobile);


          This will go through the table, copy it over, ignoring any dups. I am, however, hesitant to say that it will properly handle NULLs in those columns.



          Plan B:



          CREATE TABLE new LIKE users;
          ALTER TABLE new ADD UNIQUE(email, alt_email, mobile and alt_mobile);
          INSERT INTO new
          SELECT * FROM users
          GROUP BY (email, alt_email, mobile and alt_mobile);
          check the contents of `new` to see if it worked correctly, esp for NULLs
          RENAME TABLE users TO old, new to users;
          DROP TABLE old;


          Plan C: Do some other query, but use <=> instead of = when comparing potentially NULL columns. Documentation.






          share|improve this answer































            3














            Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.



            You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use FORCE INDEX anyway.



            https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/



            Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it cdate) or the value of an auto-increment column (I'll call it id).



            You may want to create a new table users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:



            SELECT 
            t1.id,
            t2.id,
            t1.cdate,
            t2.cdate,
            t1.email,
            t2.email,
            t1.alt_email,
            t2.alt_email,
            t1.mobile,
            t2.mobile,
            t1.alt_mobile,
            t2.alt_mobile
            # Include whatever other columns may be helpful
            FROM
            # Or users_test if it exists
            users t1
            JOIN users t2 ON (
            # We want t2 to always contain a duplicate but older record
            t1.cdate > t2.cdate
            AND
            t1.email = t2.email
            AND
            t1.alt_email = t2.alt_email
            AND
            t1.mobile = t2.mobile
            AND
            t1.alt_mobile = t2.alt_mobile
            )
            ;


            Sanity check the results. You want to make damn sure your data matches your expectations before running a DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row.



            If you don't have a date to compare use id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.



            Adding an index on these columns will slow down writes. So if it is vitally important that writes to the users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production.



            DELETE FROM users WHERE id in (/* list of t2.id values here */);


            Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.



            If this proves unworkable I can think of two other strategies worth investigating.



            Use a cloned table similar to user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison.



            Or you could rename your existing users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.



            INSERT INTO users 
            SELECT * FROM users_old
            ON DUPLICATE KEY UPDATE
            some_column = VALUES(some_column),
            some_other_column = VALUES(some_other_column), …
            ;


            You can leave the columns from the UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.






            share|improve this answer


























              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "182"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f142777%2fwhat-is-effective-way-to-delete-duplicate-records-based-on-multiple-columns-in-m%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              3














              Plan A:



              ALTER IGNORE TABLE users
              ADD UNIQUE(email, alt_email, mobile and alt_mobile);


              This will go through the table, copy it over, ignoring any dups. I am, however, hesitant to say that it will properly handle NULLs in those columns.



              Plan B:



              CREATE TABLE new LIKE users;
              ALTER TABLE new ADD UNIQUE(email, alt_email, mobile and alt_mobile);
              INSERT INTO new
              SELECT * FROM users
              GROUP BY (email, alt_email, mobile and alt_mobile);
              check the contents of `new` to see if it worked correctly, esp for NULLs
              RENAME TABLE users TO old, new to users;
              DROP TABLE old;


              Plan C: Do some other query, but use <=> instead of = when comparing potentially NULL columns. Documentation.






              share|improve this answer




























                3














                Plan A:



                ALTER IGNORE TABLE users
                ADD UNIQUE(email, alt_email, mobile and alt_mobile);


                This will go through the table, copy it over, ignoring any dups. I am, however, hesitant to say that it will properly handle NULLs in those columns.



                Plan B:



                CREATE TABLE new LIKE users;
                ALTER TABLE new ADD UNIQUE(email, alt_email, mobile and alt_mobile);
                INSERT INTO new
                SELECT * FROM users
                GROUP BY (email, alt_email, mobile and alt_mobile);
                check the contents of `new` to see if it worked correctly, esp for NULLs
                RENAME TABLE users TO old, new to users;
                DROP TABLE old;


                Plan C: Do some other query, but use <=> instead of = when comparing potentially NULL columns. Documentation.






                share|improve this answer


























                  3












                  3








                  3







                  Plan A:



                  ALTER IGNORE TABLE users
                  ADD UNIQUE(email, alt_email, mobile and alt_mobile);


                  This will go through the table, copy it over, ignoring any dups. I am, however, hesitant to say that it will properly handle NULLs in those columns.



                  Plan B:



                  CREATE TABLE new LIKE users;
                  ALTER TABLE new ADD UNIQUE(email, alt_email, mobile and alt_mobile);
                  INSERT INTO new
                  SELECT * FROM users
                  GROUP BY (email, alt_email, mobile and alt_mobile);
                  check the contents of `new` to see if it worked correctly, esp for NULLs
                  RENAME TABLE users TO old, new to users;
                  DROP TABLE old;


                  Plan C: Do some other query, but use <=> instead of = when comparing potentially NULL columns. Documentation.






                  share|improve this answer













                  Plan A:



                  ALTER IGNORE TABLE users
                  ADD UNIQUE(email, alt_email, mobile and alt_mobile);


                  This will go through the table, copy it over, ignoring any dups. I am, however, hesitant to say that it will properly handle NULLs in those columns.



                  Plan B:



                  CREATE TABLE new LIKE users;
                  ALTER TABLE new ADD UNIQUE(email, alt_email, mobile and alt_mobile);
                  INSERT INTO new
                  SELECT * FROM users
                  GROUP BY (email, alt_email, mobile and alt_mobile);
                  check the contents of `new` to see if it worked correctly, esp for NULLs
                  RENAME TABLE users TO old, new to users;
                  DROP TABLE old;


                  Plan C: Do some other query, but use <=> instead of = when comparing potentially NULL columns. Documentation.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jul 10 '16 at 16:45









                  Rick JamesRick James

                  43.8k22259




                  43.8k22259

























                      3














                      Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.



                      You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use FORCE INDEX anyway.



                      https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/



                      Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it cdate) or the value of an auto-increment column (I'll call it id).



                      You may want to create a new table users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:



                      SELECT 
                      t1.id,
                      t2.id,
                      t1.cdate,
                      t2.cdate,
                      t1.email,
                      t2.email,
                      t1.alt_email,
                      t2.alt_email,
                      t1.mobile,
                      t2.mobile,
                      t1.alt_mobile,
                      t2.alt_mobile
                      # Include whatever other columns may be helpful
                      FROM
                      # Or users_test if it exists
                      users t1
                      JOIN users t2 ON (
                      # We want t2 to always contain a duplicate but older record
                      t1.cdate > t2.cdate
                      AND
                      t1.email = t2.email
                      AND
                      t1.alt_email = t2.alt_email
                      AND
                      t1.mobile = t2.mobile
                      AND
                      t1.alt_mobile = t2.alt_mobile
                      )
                      ;


                      Sanity check the results. You want to make damn sure your data matches your expectations before running a DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row.



                      If you don't have a date to compare use id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.



                      Adding an index on these columns will slow down writes. So if it is vitally important that writes to the users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production.



                      DELETE FROM users WHERE id in (/* list of t2.id values here */);


                      Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.



                      If this proves unworkable I can think of two other strategies worth investigating.



                      Use a cloned table similar to user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison.



                      Or you could rename your existing users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.



                      INSERT INTO users 
                      SELECT * FROM users_old
                      ON DUPLICATE KEY UPDATE
                      some_column = VALUES(some_column),
                      some_other_column = VALUES(some_other_column), …
                      ;


                      You can leave the columns from the UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.






                      share|improve this answer






























                        3














                        Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.



                        You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use FORCE INDEX anyway.



                        https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/



                        Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it cdate) or the value of an auto-increment column (I'll call it id).



                        You may want to create a new table users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:



                        SELECT 
                        t1.id,
                        t2.id,
                        t1.cdate,
                        t2.cdate,
                        t1.email,
                        t2.email,
                        t1.alt_email,
                        t2.alt_email,
                        t1.mobile,
                        t2.mobile,
                        t1.alt_mobile,
                        t2.alt_mobile
                        # Include whatever other columns may be helpful
                        FROM
                        # Or users_test if it exists
                        users t1
                        JOIN users t2 ON (
                        # We want t2 to always contain a duplicate but older record
                        t1.cdate > t2.cdate
                        AND
                        t1.email = t2.email
                        AND
                        t1.alt_email = t2.alt_email
                        AND
                        t1.mobile = t2.mobile
                        AND
                        t1.alt_mobile = t2.alt_mobile
                        )
                        ;


                        Sanity check the results. You want to make damn sure your data matches your expectations before running a DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row.



                        If you don't have a date to compare use id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.



                        Adding an index on these columns will slow down writes. So if it is vitally important that writes to the users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production.



                        DELETE FROM users WHERE id in (/* list of t2.id values here */);


                        Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.



                        If this proves unworkable I can think of two other strategies worth investigating.



                        Use a cloned table similar to user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison.



                        Or you could rename your existing users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.



                        INSERT INTO users 
                        SELECT * FROM users_old
                        ON DUPLICATE KEY UPDATE
                        some_column = VALUES(some_column),
                        some_other_column = VALUES(some_other_column), …
                        ;


                        You can leave the columns from the UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.






                        share|improve this answer




























                          3












                          3








                          3







                          Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.



                          You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use FORCE INDEX anyway.



                          https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/



                          Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it cdate) or the value of an auto-increment column (I'll call it id).



                          You may want to create a new table users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:



                          SELECT 
                          t1.id,
                          t2.id,
                          t1.cdate,
                          t2.cdate,
                          t1.email,
                          t2.email,
                          t1.alt_email,
                          t2.alt_email,
                          t1.mobile,
                          t2.mobile,
                          t1.alt_mobile,
                          t2.alt_mobile
                          # Include whatever other columns may be helpful
                          FROM
                          # Or users_test if it exists
                          users t1
                          JOIN users t2 ON (
                          # We want t2 to always contain a duplicate but older record
                          t1.cdate > t2.cdate
                          AND
                          t1.email = t2.email
                          AND
                          t1.alt_email = t2.alt_email
                          AND
                          t1.mobile = t2.mobile
                          AND
                          t1.alt_mobile = t2.alt_mobile
                          )
                          ;


                          Sanity check the results. You want to make damn sure your data matches your expectations before running a DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row.



                          If you don't have a date to compare use id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.



                          Adding an index on these columns will slow down writes. So if it is vitally important that writes to the users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production.



                          DELETE FROM users WHERE id in (/* list of t2.id values here */);


                          Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.



                          If this proves unworkable I can think of two other strategies worth investigating.



                          Use a cloned table similar to user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison.



                          Or you could rename your existing users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.



                          INSERT INTO users 
                          SELECT * FROM users_old
                          ON DUPLICATE KEY UPDATE
                          some_column = VALUES(some_column),
                          some_other_column = VALUES(some_other_column), …
                          ;


                          You can leave the columns from the UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.






                          share|improve this answer















                          Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.



                          You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use FORCE INDEX anyway.



                          https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/



                          Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it cdate) or the value of an auto-increment column (I'll call it id).



                          You may want to create a new table users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:



                          SELECT 
                          t1.id,
                          t2.id,
                          t1.cdate,
                          t2.cdate,
                          t1.email,
                          t2.email,
                          t1.alt_email,
                          t2.alt_email,
                          t1.mobile,
                          t2.mobile,
                          t1.alt_mobile,
                          t2.alt_mobile
                          # Include whatever other columns may be helpful
                          FROM
                          # Or users_test if it exists
                          users t1
                          JOIN users t2 ON (
                          # We want t2 to always contain a duplicate but older record
                          t1.cdate > t2.cdate
                          AND
                          t1.email = t2.email
                          AND
                          t1.alt_email = t2.alt_email
                          AND
                          t1.mobile = t2.mobile
                          AND
                          t1.alt_mobile = t2.alt_mobile
                          )
                          ;


                          Sanity check the results. You want to make damn sure your data matches your expectations before running a DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row.



                          If you don't have a date to compare use id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.



                          Adding an index on these columns will slow down writes. So if it is vitally important that writes to the users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production.



                          DELETE FROM users WHERE id in (/* list of t2.id values here */);


                          Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.



                          If this proves unworkable I can think of two other strategies worth investigating.



                          Use a cloned table similar to user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison.



                          Or you could rename your existing users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.



                          INSERT INTO users 
                          SELECT * FROM users_old
                          ON DUPLICATE KEY UPDATE
                          some_column = VALUES(some_column),
                          some_other_column = VALUES(some_other_column), …
                          ;


                          You can leave the columns from the UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited 4 mins ago









                          Paul White

                          54.1k14287461




                          54.1k14287461










                          answered Jul 1 '16 at 17:22









                          dap6000dap6000

                          314




                          314






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Database Administrators Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f142777%2fwhat-is-effective-way-to-delete-duplicate-records-based-on-multiple-columns-in-m%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              ORA-01691 (unable to extend lob segment) even though my tablespace has AUTOEXTEND onORA-01692: unable to...

                              Always On Availability groups resolving state after failover - Remote harden of transaction...

                              Circunscripción electoral de Guipúzcoa Referencias Menú de navegaciónLas claves del sistema electoral en...