Choosing sort and distkeys for redshift with multiple joins The 2019 Stack Overflow Developer...
How to create dashed lines/arrows in Illustrator
What is the steepest angle that a canal can be traversable without locks?
Where to refill my bottle in India?
Why is Grand Jury testimony secret?
How was Skylab's orbit inclination chosen?
How can I fix this gap between bookcases I made?
Protecting Dualbooting Windows from dangerous code (like rm -rf)
What is the motivation for a law requiring 2 parties to consent for recording a conversation
What could be the right powersource for 15 seconds lifespan disposable giant chainsaw?
Does a dangling wire really electrocute me if I'm standing in water?
What can other administrators access on my machine?
Can't find the latex code for the ⍎ (down tack jot) symbol
Patience, young "Padovan"
Could JWST stay at L2 "forever"?
What does Linus Torvalds mean when he says that Git "never ever" tracks a file?
Is flight data recorder erased after every flight?
I see my dog run
In microwave frequencies, do you use a circulator when you need a (near) perfect diode?
Why can Shazam do this?
Access elements in std::string where positon of string is greater than its size
JSON.serialize: is it possible to suppress null values of a map?
Why is my p-value correlated to difference between means in two sample tests?
Does duplicating a spell with Wish count as casting that spell?
How to change the limits of integration
Choosing sort and distkeys for redshift with multiple joins
The 2019 Stack Overflow Developer Survey Results Are InManaging code and deployments to Amazon RedshiftAmazon Redshift “cache lookup failed for aggregate” error when using median aggregateRedshift table not showing up in tables for schema?Issue using COPY command with AWS RedshiftStorage size for varchar length in RedshiftRedshift: InternalError: cache lookup failed for relationredshift drop table with + and spaces in nameRedshift window function for trend % calculationRedshift DROP and TRUNCATE hang with no locksBest Distkey and Sortkey for the tables in Redshift
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:
users
-> accounts
-> subscriptions
. These are in a 1:1:1 relationship. The joins are as follows:
users.id
= accounts.user_id
accounts.id
= subscriptions.accounts_id
.
What is the optimal way to set the sort and dist keys on the accounts
table in this situation?
redshift
bumped to the homepage by Community♦ 13 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:
users
-> accounts
-> subscriptions
. These are in a 1:1:1 relationship. The joins are as follows:
users.id
= accounts.user_id
accounts.id
= subscriptions.accounts_id
.
What is the optimal way to set the sort and dist keys on the accounts
table in this situation?
redshift
bumped to the homepage by Community♦ 13 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13
add a comment |
I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:
users
-> accounts
-> subscriptions
. These are in a 1:1:1 relationship. The joins are as follows:
users.id
= accounts.user_id
accounts.id
= subscriptions.accounts_id
.
What is the optimal way to set the sort and dist keys on the accounts
table in this situation?
redshift
I have a relatively large Redshift cluster which I am attempting to optimize via the selection of sort and dist keys. I have a recurring situation where multiple tables are joined together via an intermediary join. The join is relatively stable among all queries. I know that you're supposed to distribute on the field that you join to and sort on the conditions in the where clause. However, in this situation there is no where clause and it is joined twice. For example I have three tables:
users
-> accounts
-> subscriptions
. These are in a 1:1:1 relationship. The joins are as follows:
users.id
= accounts.user_id
accounts.id
= subscriptions.accounts_id
.
What is the optimal way to set the sort and dist keys on the accounts
table in this situation?
redshift
redshift
asked May 9 '17 at 17:43
user2694306user2694306
10117
10117
bumped to the homepage by Community♦ 13 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 13 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13
add a comment |
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13
add a comment |
1 Answer
1
active
oldest
votes
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f173115%2fchoosing-sort-and-distkeys-for-redshift-with-multiple-joins%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.
add a comment |
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.
add a comment |
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.
The ideal situation is if you can have a user_id column added onto subscriptions, then you can distribute all three by user_id. Then your joins would be written as:
FROM subscriptions
LEFT JOIN accounts
ON accounts.id = subscriptions.account_id
AND accounts.user_id = subscriptions.user_id --Note this add'l condition
LEFT JOIN users
ON users.id=accounts.user_id
The one caveat here is that your number of users should be large enough to ensure that this distribution key does not result in significant skew.
As a bonus, if you do not frequently filter these, you can choose the same column (user_id) as your sort key to enable merge joins rather than hash joins
If you can't add this column, then you are going to have network distribution happening in one of your two joins :-( Pick the one that you want based on frequency of that join and average width of rows participating in that join.
edited Jul 4 '17 at 23:12
joanolo
9,92842254
9,92842254
answered Jul 4 '17 at 18:28
Fabio BeltraminiFabio Beltramini
1716
1716
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f173115%2fchoosing-sort-and-distkeys-for-redshift-with-multiple-joins%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If they are 1:1:1, why not denormalize them into one table?
– CalZ
May 9 '17 at 17:48
I would love to, but business restraints require that they stay that way.
– user2694306
May 9 '17 at 18:53
(Sometimes is a matter of privileges: Who can see what. It's normally easier to handle permissions per table than to do it per column)
– joanolo
Jul 4 '17 at 23:13