Postgres is performing sequential scan instead of index scanOptimize SELECT DISTINCT (with existing index) to...

Why are electrically insulating heatsinks so rare? Is it just cost?

Smoothness of finite-dimensional functional calculus

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Risk of getting Chronic Wasting Disease (CWD) in the United States?

To string or not to string

How does strength of boric acid solution increase in presence of salicylic acid?

Is it possible to do 50 km distance without any previous training?

What's the output of a record cartridge playing an out-of-speed record

Dragon forelimb placement

Why not use SQL instead of GraphQL?

Why "Having chlorophyll without photosynthesis is actually very dangerous" and "like living with a bomb"?

How to find program name(s) of an installed package?

Why doesn't H₄O²⁺ exist?

Maximum likelihood parameters deviate from posterior distributions

How to format long polynomial?

Mage Armor with Defense fighting style (for Adventurers League bladeslinger)

Python: next in for loop

A newer friend of my brother's gave him a load of baseball cards that are supposedly extremely valuable. Is this a scam?

Why dont electromagnetic waves interact with each other?

Can an x86 CPU running in real mode be considered to be basically an 8086 CPU?

Why, historically, did Gödel think CH was false?

How is it possible to have an ability score that is less than 3?

What typically incentivizes a professor to change jobs to a lower ranking university?

Why does Kotter return in Welcome Back Kotter?

Postgres is performing sequential scan instead of index scan

Optimize SELECT DISTINCT (with existing index) to avoid sequential scanWhy are correlated subqueries sometimes faster than joins in Postgres?How can I speed up a Postgres query containing lots of Joins with an ILIKE conditionpostgres explain plan with giant gaps between operationsImprove performance on concurrent UPDATEs for a timestamp column in PostgresSlow fulltext search due to wildly inaccurate row estimatesIndex for numeric field is not usedpostgresql 9.2 hash join issueWhy does PostgreSQL perform a seq scan when comparing a numeric value with a bigint column?Why is this query with WHERE, ORDER BY and LIMIT so slow?Performance difference in accessing differrent columns in a Postgres Table

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";

                              QUERY PLAN

-----------------------------------------------------------------------

 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)

   Group Key: "labelDate"

   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)

(3 rows)

Table structure:

http=# d pages

                                       Table "public.pages"

     Column      |          Type          |        Modifiers

-----------------+------------------------+----------------------------------

 pageid          | integer                | not null default nextval('...

 createDate      | integer                | not null

 archive         | character varying(16)  | not null

 label           | character varying(32)  | not null

 wptid           | character varying(64)  | not null

 wptrun          | integer                | not null

 url             | text                   |

 urlShort        | character varying(255) |

 startedDateTime | integer                |

 renderStart     | integer                |

 onContentLoaded | integer                |

 onLoad          | integer                |

 PageSpeed       | integer                |

 rank            | integer                |

 reqTotal        | integer                | not null

 reqHTML         | integer                | not null

 reqJS           | integer                | not null

 reqCSS          | integer                | not null

 reqImg          | integer                | not null

 reqFlash        | integer                | not null

 reqJSON         | integer                | not null

 reqOther        | integer                | not null

 bytesTotal      | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesImg        | integer                | not null

 bytesFlash      | integer                | not null

 bytesJSON       | integer                | not null

 bytesOther      | integer                | not null

 numDomains      | integer                | not null

 labelDate       | date                   |

 TTFB            | integer                |

 reqGIF          | smallint               | not null

 reqJPG          | smallint               | not null

 reqPNG          | smallint               | not null

 reqFont         | smallint               | not null

 bytesGIF        | integer                | not null

 bytesJPG        | integer                | not null

 bytesPNG        | integer                | not null

 bytesFont       | integer                | not null

 maxageMore      | smallint               | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 fullyLoaded     | integer                |

 cdn             | character varying(64)  |

 SpeedIndex      | integer                |

 visualComplete  | integer                |

 gzipTotal       | integer                | not null

 gzipSavings     | integer                | not null

 siteid          | numeric                |

Indexes:

    "pages_pkey" PRIMARY KEY, btree (pageid)

    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")

    "idx_pages_cdn" btree (cdn)

    "idx_pages_labeldate" btree ("labelDate") CLUSTER

    "idx_pages_urlshort" btree ("urlShort")

Triggers:

    pages_label_date BEFORE INSERT OR UPDATE ON pages

      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

add a comment |

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";

                              QUERY PLAN

-----------------------------------------------------------------------

 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)

   Group Key: "labelDate"

   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)

(3 rows)

Table structure:

http=# d pages

                                       Table "public.pages"

     Column      |          Type          |        Modifiers

-----------------+------------------------+----------------------------------

 pageid          | integer                | not null default nextval('...

 createDate      | integer                | not null

 archive         | character varying(16)  | not null

 label           | character varying(32)  | not null

 wptid           | character varying(64)  | not null

 wptrun          | integer                | not null

 url             | text                   |

 urlShort        | character varying(255) |

 startedDateTime | integer                |

 renderStart     | integer                |

 onContentLoaded | integer                |

 onLoad          | integer                |

 PageSpeed       | integer                |

 rank            | integer                |

 reqTotal        | integer                | not null

 reqHTML         | integer                | not null

 reqJS           | integer                | not null

 reqCSS          | integer                | not null

 reqImg          | integer                | not null

 reqFlash        | integer                | not null

 reqJSON         | integer                | not null

 reqOther        | integer                | not null

 bytesTotal      | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesImg        | integer                | not null

 bytesFlash      | integer                | not null

 bytesJSON       | integer                | not null

 bytesOther      | integer                | not null

 numDomains      | integer                | not null

 labelDate       | date                   |

 TTFB            | integer                |

 reqGIF          | smallint               | not null

 reqJPG          | smallint               | not null

 reqPNG          | smallint               | not null

 reqFont         | smallint               | not null

 bytesGIF        | integer                | not null

 bytesJPG        | integer                | not null

 bytesPNG        | integer                | not null

 bytesFont       | integer                | not null

 maxageMore      | smallint               | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 fullyLoaded     | integer                |

 cdn             | character varying(64)  |

 SpeedIndex      | integer                |

 visualComplete  | integer                |

 gzipTotal       | integer                | not null

 gzipSavings     | integer                | not null

 siteid          | numeric                |

Indexes:

    "pages_pkey" PRIMARY KEY, btree (pageid)

    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")

    "idx_pages_cdn" btree (cdn)

    "idx_pages_labeldate" btree ("labelDate") CLUSTER

    "idx_pages_urlshort" btree ("urlShort")

Triggers:

    pages_label_date BEFORE INSERT OR UPDATE ON pages

      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

add a comment |

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";

                              QUERY PLAN

-----------------------------------------------------------------------

 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)

   Group Key: "labelDate"

   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)

(3 rows)

Table structure:

http=# d pages

                                       Table "public.pages"

     Column      |          Type          |        Modifiers

-----------------+------------------------+----------------------------------

 pageid          | integer                | not null default nextval('...

 createDate      | integer                | not null

 archive         | character varying(16)  | not null

 label           | character varying(32)  | not null

 wptid           | character varying(64)  | not null

 wptrun          | integer                | not null

 url             | text                   |

 urlShort        | character varying(255) |

 startedDateTime | integer                |

 renderStart     | integer                |

 onContentLoaded | integer                |

 onLoad          | integer                |

 PageSpeed       | integer                |

 rank            | integer                |

 reqTotal        | integer                | not null

 reqHTML         | integer                | not null

 reqJS           | integer                | not null

 reqCSS          | integer                | not null

 reqImg          | integer                | not null

 reqFlash        | integer                | not null

 reqJSON         | integer                | not null

 reqOther        | integer                | not null

 bytesTotal      | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesImg        | integer                | not null

 bytesFlash      | integer                | not null

 bytesJSON       | integer                | not null

 bytesOther      | integer                | not null

 numDomains      | integer                | not null

 labelDate       | date                   |

 TTFB            | integer                |

 reqGIF          | smallint               | not null

 reqJPG          | smallint               | not null

 reqPNG          | smallint               | not null

 reqFont         | smallint               | not null

 bytesGIF        | integer                | not null

 bytesJPG        | integer                | not null

 bytesPNG        | integer                | not null

 bytesFont       | integer                | not null

 maxageMore      | smallint               | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 fullyLoaded     | integer                |

 cdn             | character varying(64)  |

 SpeedIndex      | integer                |

 visualComplete  | integer                |

 gzipTotal       | integer                | not null

 gzipSavings     | integer                | not null

 siteid          | numeric                |

Indexes:

    "pages_pkey" PRIMARY KEY, btree (pageid)

    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")

    "idx_pages_cdn" btree (cdn)

    "idx_pages_labeldate" btree ("labelDate") CLUSTER

    "idx_pages_urlshort" btree ("urlShort")

Triggers:

    pages_label_date BEFORE INSERT OR UPDATE ON pages

      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";

                              QUERY PLAN

-----------------------------------------------------------------------

 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)

   Group Key: "labelDate"

   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)

(3 rows)

Table structure:

http=# d pages

                                       Table "public.pages"

     Column      |          Type          |        Modifiers

-----------------+------------------------+----------------------------------

 pageid          | integer                | not null default nextval('...

 createDate      | integer                | not null

 archive         | character varying(16)  | not null

 label           | character varying(32)  | not null

 wptid           | character varying(64)  | not null

 wptrun          | integer                | not null

 url             | text                   |

 urlShort        | character varying(255) |

 startedDateTime | integer                |

 renderStart     | integer                |

 onContentLoaded | integer                |

 onLoad          | integer                |

 PageSpeed       | integer                |

 rank            | integer                |

 reqTotal        | integer                | not null

 reqHTML         | integer                | not null

 reqJS           | integer                | not null

 reqCSS          | integer                | not null

 reqImg          | integer                | not null

 reqFlash        | integer                | not null

 reqJSON         | integer                | not null

 reqOther        | integer                | not null

 bytesTotal      | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesHTML       | integer                | not null

 bytesJS         | integer                | not null

 bytesCSS        | integer                | not null

 bytesImg        | integer                | not null

 bytesFlash      | integer                | not null

 bytesJSON       | integer                | not null

 bytesOther      | integer                | not null

 numDomains      | integer                | not null

 labelDate       | date                   |

 TTFB            | integer                |

 reqGIF          | smallint               | not null

 reqJPG          | smallint               | not null

 reqPNG          | smallint               | not null

 reqFont         | smallint               | not null

 bytesGIF        | integer                | not null

 bytesJPG        | integer                | not null

 bytesPNG        | integer                | not null

 bytesFont       | integer                | not null

 maxageMore      | smallint               | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 maxage365       | smallint               | not null

 maxage30        | smallint               | not null

 maxage1         | smallint               | not null

 maxage0         | smallint               | not null

 maxageNull      | smallint               | not null

 numDomElements  | integer                | not null

 numCompressed   | smallint               | not null

 numHTTPS        | smallint               | not null

 numGlibs        | smallint               | not null

 numErrors       | smallint               | not null

 numRedirects    | smallint               | not null

 maxDomainReqs   | smallint               | not null

 bytesHTMLDoc    | integer                | not null

 fullyLoaded     | integer                |

 cdn             | character varying(64)  |

 SpeedIndex      | integer                |

 visualComplete  | integer                |

 gzipTotal       | integer                | not null

 gzipSavings     | integer                | not null

 siteid          | numeric                |

Indexes:

    "pages_pkey" PRIMARY KEY, btree (pageid)

    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")

    "idx_pages_cdn" btree (cdn)

    "idx_pages_labeldate" btree ("labelDate") CLUSTER

    "idx_pages_urlshort" btree ("urlShort")

Triggers:

    pages_label_date BEFORE INSERT OR UPDATE ON pages

      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

postgresql index query-performance postgresql-9.4

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

edited Jul 2 '15 at 0:01

Erwin Brandstetter

95.5k9185300

asked Jun 30 '15 at 12:44

Charlie Clark

15017

asked Jun 30 '15 at 12:44

Charlie Clark

15017

asked Jun 30 '15 at 12:44

Charlie Clark

15017

add a comment |

3 Answers
3

active

oldest

votes

This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 

( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 

  UNION ALL

    SELECT n.labelDate 

    FROM pa AS p

         , LATERAL 

              ( SELECT labelDate 

                FROM pages 

                WHERE labelDate > p.labelDate 

                ORDER BY labelDate 

                LIMIT 1

              ) AS n

) 

SELECT labelDate 

FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

add a comment |

The best query very much depends on data distribution.

You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.

(For more distinct values the case would get more interesting.)

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".

With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")

WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:

SELECT d."labelDate"

FROM  (

   SELECT generate_series(min("labelDate")::timestamp

                        , max("labelDate")::timestamp

                        , interval '1 day')::date AS "labelDate"

   FROM   pages

   ) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM generate_series(0, 363) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.

WITH RECURSIVE p AS (

   ( -- parentheses required for LIMIT

   SELECT "labelDate"

   FROM   pages

   WHERE  "labelDate" IS NOT NULL

   ORDER  BY "labelDate"

   LIMIT  1

   ) 

   UNION ALL

   SELECT (SELECT "labelDate" 

           FROM   pages 

           WHERE  "labelDate" > p."labelDate" 

           ORDER  BY "labelDate" 

           LIMIT  1)

   FROM   p

   WHERE  "labelDate" IS NOT NULL

   ) 

SELECT "labelDate" 

FROM   p

WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).

For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:

Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.

Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

add a comment |

-2

From the postgresql documentation:

CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f105537%2fpostgres-is-performing-sequential-scan-instead-of-index-scan%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 

( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 

  UNION ALL

    SELECT n.labelDate 

    FROM pa AS p

         , LATERAL 

              ( SELECT labelDate 

                FROM pages 

                WHERE labelDate > p.labelDate 

                ORDER BY labelDate 

                LIMIT 1

              ) AS n

) 

SELECT labelDate 

FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

add a comment |

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 

( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 

  UNION ALL

    SELECT n.labelDate 

    FROM pa AS p

         , LATERAL 

              ( SELECT labelDate 

                FROM pages 

                WHERE labelDate > p.labelDate 

                ORDER BY labelDate 

                LIMIT 1

              ) AS n

) 

SELECT labelDate 

FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

add a comment |

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 

( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 

  UNION ALL

    SELECT n.labelDate 

    FROM pa AS p

         , LATERAL 

              ( SELECT labelDate 

                FROM pages 

                WHERE labelDate > p.labelDate 

                ORDER BY labelDate 

                LIMIT 1

              ) AS n

) 

SELECT labelDate 

FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 

( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 

  UNION ALL

    SELECT n.labelDate 

    FROM pa AS p

         , LATERAL 

              ( SELECT labelDate 

                FROM pages 

                WHERE labelDate > p.labelDate 

                ORDER BY labelDate 

                LIMIT 1

              ) AS n

) 

SELECT labelDate 

FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

edited May 23 '17 at 12:40

Community♦

edited May 23 '17 at 12:40

Community♦

edited May 23 '17 at 12:40

Community♦

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

answered Jun 30 '15 at 13:42

ypercubeᵀᴹ

78.1k11136219

add a comment |

The best query very much depends on data distribution.

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".

With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")

WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

SELECT d."labelDate"

FROM  (

   SELECT generate_series(min("labelDate")::timestamp

                        , max("labelDate")::timestamp

                        , interval '1 day')::date AS "labelDate"

   FROM   pages

   ) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM generate_series(0, 363) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

WITH RECURSIVE p AS (

   ( -- parentheses required for LIMIT

   SELECT "labelDate"

   FROM   pages

   WHERE  "labelDate" IS NOT NULL

   ORDER  BY "labelDate"

   LIMIT  1

   ) 

   UNION ALL

   SELECT (SELECT "labelDate" 

           FROM   pages 

           WHERE  "labelDate" > p."labelDate" 

           ORDER  BY "labelDate" 

           LIMIT  1)

   FROM   p

   WHERE  "labelDate" IS NOT NULL

   ) 

SELECT "labelDate" 

FROM   p

WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).

For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:

Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.

Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

add a comment |

The best query very much depends on data distribution.

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".

With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")

WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

SELECT d."labelDate"

FROM  (

   SELECT generate_series(min("labelDate")::timestamp

                        , max("labelDate")::timestamp

                        , interval '1 day')::date AS "labelDate"

   FROM   pages

   ) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM generate_series(0, 363) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

WITH RECURSIVE p AS (

   ( -- parentheses required for LIMIT

   SELECT "labelDate"

   FROM   pages

   WHERE  "labelDate" IS NOT NULL

   ORDER  BY "labelDate"

   LIMIT  1

   ) 

   UNION ALL

   SELECT (SELECT "labelDate" 

           FROM   pages 

           WHERE  "labelDate" > p."labelDate" 

           ORDER  BY "labelDate" 

           LIMIT  1)

   FROM   p

   WHERE  "labelDate" IS NOT NULL

   ) 

SELECT "labelDate" 

FROM   p

WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).

For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:

Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.

Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

add a comment |

The best query very much depends on data distribution.

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".

With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")

WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

SELECT d."labelDate"

FROM  (

   SELECT generate_series(min("labelDate")::timestamp

                        , max("labelDate")::timestamp

                        , interval '1 day')::date AS "labelDate"

   FROM   pages

   ) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM generate_series(0, 363) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

WITH RECURSIVE p AS (

   ( -- parentheses required for LIMIT

   SELECT "labelDate"

   FROM   pages

   WHERE  "labelDate" IS NOT NULL

   ORDER  BY "labelDate"

   LIMIT  1

   ) 

   UNION ALL

   SELECT (SELECT "labelDate" 

           FROM   pages 

           WHERE  "labelDate" > p."labelDate" 

           ORDER  BY "labelDate" 

           LIMIT  1)

   FROM   p

   WHERE  "labelDate" IS NOT NULL

   ) 

SELECT "labelDate" 

FROM   p

WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).

For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:

Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.

Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

The best query very much depends on data distribution.

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".

With more than a few NULL values in the column, a partial index helps some more (and is smaller):

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")

WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

SELECT d."labelDate"

FROM  (

   SELECT generate_series(min("labelDate")::timestamp

                        , max("labelDate")::timestamp

                        , interval '1 day')::date AS "labelDate"

   FROM   pages

   ) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

SELECT d."labelDate"

FROM  (SELECT date '2011-01-01' + g AS "labelDate"

       FROM generate_series(0, 363) g) d

WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

WITH RECURSIVE p AS (

   ( -- parentheses required for LIMIT

   SELECT "labelDate"

   FROM   pages

   WHERE  "labelDate" IS NOT NULL

   ORDER  BY "labelDate"

   LIMIT  1

   ) 

   UNION ALL

   SELECT (SELECT "labelDate" 

           FROM   pages 

           WHERE  "labelDate" > p."labelDate" 

           ORDER  BY "labelDate" 

           LIMIT  1)

   FROM   p

   WHERE  "labelDate" IS NOT NULL

   ) 

SELECT "labelDate" 

FROM   p

WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).

For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:

Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.

Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

edited 4 mins ago

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

answered Jul 1 '15 at 23:56

Erwin Brandstetter

95.5k9185300

add a comment |

-2

From the postgresql documentation:

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

add a comment |

-2

From the postgresql documentation:

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

add a comment |

-2

From the postgresql documentation:

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

From the postgresql documentation:

Your index on labelDate is a btree..

Reference:

http://www.postgresql.org/docs/9.1/static/sql-cluster.html

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

edited Jun 30 '15 at 13:07

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

answered Jun 30 '15 at 12:55

Fabrizio Mazzoni

1,17221225

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

add a comment |

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

Even with a condition such as `WHERE "labelDate" BETWEEN '2000-01-01' and '2020-01-01' still involves a sequential scan.

– Charlie Clark
Jun 30 '15 at 12:58

Clustering at the moment (though data was entered roughly in that order). That still doesn't really explain the query planner decision not to use an index even with a WHERE clause.

– Charlie Clark
Jun 30 '15 at 13:16

Have you tried also to disable sequential scan for the session? set enable_seqscan=off IN any case the documentation is clear. If you cluster it will perform a sequential scan.

– Fabrizio Mazzoni
Jun 30 '15 at 13:20

Yes, I tried disabling the sequential scan but it didn't make much difference. The speed of this query isn't actually crucial as I use it to create a lookup table which can then be used for JOINS in real queries.

– Charlie Clark
Jun 30 '15 at 13:23

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sfrgttk