Choosing sort and distkeys for redshift with multiple joins The 2019 Stack Overflow Developer...

How to create dashed lines/arrows in Illustrator

What is the steepest angle that a canal can be traversable without locks?

Where to refill my bottle in India?

Why is Grand Jury testimony secret?

How was Skylab's orbit inclination chosen?

How can I fix this gap between bookcases I made?

Protecting Dualbooting Windows from dangerous code (like rm -rf)

What is the motivation for a law requiring 2 parties to consent for recording a conversation

What could be the right powersource for 15 seconds lifespan disposable giant chainsaw?

Does a dangling wire really electrocute me if I'm standing in water?

What can other administrators access on my machine?

Can't find the latex code for the ⍎ (down tack jot) symbol

Patience, young "Padovan"

Could JWST stay at L2 "forever"?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

Is flight data recorder erased after every flight?

I see my dog run

In microwave frequencies, do you use a circulator when you need a (near) perfect diode?

Why can Shazam do this?

Access elements in std::string where positon of string is greater than its size

JSON.serialize: is it possible to suppress null values of a map?

Why is my p-value correlated to difference between means in two sample tests?

Does duplicating a spell with Wish count as casting that spell?

How to change the limits of integration



Choosing sort and distkeys for redshift with multiple joins



The 2019 Stack Overflow Developer Survey Results Are InManaging code and deployments to Amazon RedshiftAmazon Redshift “cache lookup failed for aggregate” error when using median aggregateRedshift table not showing up in tables for schema?Issue using COPY command with AWS RedshiftStorage size for varchar length in RedshiftRedshift: InternalError: cache lookup failed for relationredshift drop table with + and spaces in nameRedshift window function for trend % calculationRedshift DROP and TRUNCATE hang with no locksBest Distkey and Sortkey for the tables in Redshift





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







0















I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:



users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:



users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.



What is the optimal way to set the sort and dist keys on the accounts table in this situation?










share|improve this question














bumped to the homepage by Community 13 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • If they are 1:1:1, why not denormalize them into one table?

    – CalZ
    May 9 '17 at 17:48











  • I would love to, but business restraints require that they stay that way.

    – user2694306
    May 9 '17 at 18:53











  • (Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

    – joanolo
    Jul 4 '17 at 23:13




















0















I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:



users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:



users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.



What is the optimal way to set the sort and dist keys on the accounts table in this situation?










share|improve this question














bumped to the homepage by Community 13 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • If they are 1:1:1, why not denormalize them into one table?

    – CalZ
    May 9 '17 at 17:48











  • I would love to, but business restraints require that they stay that way.

    – user2694306
    May 9 '17 at 18:53











  • (Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

    – joanolo
    Jul 4 '17 at 23:13
















0












0








0








I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:



users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:



users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.



What is the optimal way to set the sort and dist keys on the accounts table in this situation?










share|improve this question














I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:



users -> accounts -> subscriptions. These are in a 1:1:1 relationship. The joins are as follows:



users.id = accounts.user_id
accounts.id = subscriptions.accounts_id.



What is the optimal way to set the sort and dist keys on the accounts table in this situation?







redshift






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked May 9 '17 at 17:43









user2694306user2694306

10117




10117





bumped to the homepage by Community 13 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 13 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • If they are 1:1:1, why not denormalize them into one table?

    – CalZ
    May 9 '17 at 17:48











  • I would love to, but business restraints require that they stay that way.

    – user2694306
    May 9 '17 at 18:53











  • (Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

    – joanolo
    Jul 4 '17 at 23:13





















  • If they are 1:1:1, why not denormalize them into one table?

    – CalZ
    May 9 '17 at 17:48











  • I would love to, but business restraints require that they stay that way.

    – user2694306
    May 9 '17 at 18:53











  • (Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

    – joanolo
    Jul 4 '17 at 23:13



















If they are 1:1:1, why not denormalize them into one table?

– CalZ
May 9 '17 at 17:48





If they are 1:1:1, why not denormalize them into one table?

– CalZ
May 9 '17 at 17:48













I would love to, but business restraints require that they stay that way.

– user2694306
May 9 '17 at 18:53





I would love to, but business restraints require that they stay that way.

– user2694306
May 9 '17 at 18:53













(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

– joanolo
Jul 4 '17 at 23:13







(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)

– joanolo
Jul 4 '17 at 23:13












1 Answer
1






active

oldest

votes


















0














The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:



FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id


The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.



As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins



If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.






share|improve this answer


























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f173115%2fchoosing-sort-and-distkeys-for-redshift-with-multiple-joins%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:



    FROM subscriptions
    LEFT JOIN accounts
    ON accounts.id = subscriptions.account_id
    AND accounts.user_id = subscriptions.user_id --Note this add'l condition
    LEFT JOIN users
    ON users.id=accounts.user_id


    The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.



    As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins



    If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.






    share|improve this answer






























      0














      The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:



      FROM subscriptions
      LEFT JOIN accounts
      ON accounts.id = subscriptions.account_id
      AND accounts.user_id = subscriptions.user_id --Note this add'l condition
      LEFT JOIN users
      ON users.id=accounts.user_id


      The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.



      As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins



      If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.






      share|improve this answer




























        0












        0








        0







        The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:



        FROM subscriptions
        LEFT JOIN accounts
        ON accounts.id = subscriptions.account_id
        AND accounts.user_id = subscriptions.user_id --Note this add'l condition
        LEFT JOIN users
        ON users.id=accounts.user_id


        The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.



        As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins



        If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.






        share|improve this answer















        The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:



        FROM subscriptions
        LEFT JOIN accounts
        ON accounts.id = subscriptions.account_id
        AND accounts.user_id = subscriptions.user_id --Note this add'l condition
        LEFT JOIN users
        ON users.id=accounts.user_id


        The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.



        As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins



        If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jul 4 '17 at 23:12









        joanolo

        9,92842254




        9,92842254










        answered Jul 4 '17 at 18:28









        Fabio BeltraminiFabio Beltramini

        1716




        1716






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f173115%2fchoosing-sort-and-distkeys-for-redshift-with-multiple-joins%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            ORA-01691 (unable to extend lob segment) even though my tablespace has AUTOEXTEND onORA-01692: unable to...

            Always On Availability groups resolving state after failover - Remote harden of transaction...

            Circunscripción electoral de Guipúzcoa Referencias Menú de navegaciónLas claves del sistema electoral en...