AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418)

preview_player
Показать описание
In this session, we take an in-depth look at best practices for data warehousing with Amazon Redshift. We show you the best way to analyze all your data, inside and outside your data warehouse, without moving the data, which helps you gain deeper insights for running your business. We also cover best practices for how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
Рекомендации по теме
Комментарии
Автор

this is pretty crisp and to the point, worth your time.

aksharjamgaonkar
Автор

My notes:
19:34:
## Data distrubution strategies, for data distribution across slices (ie virtual nodes -1 physical node is made up of [2, 16] slices, acting as sharding virtual nodes). ##
Note that data in a table is stored as follows : it is distributed according to the table's data-distribution strategy (eg according to their distribution key, if you chose "KEY based distribution"), and then data is sorted within each partition according to the sort key(s) (the sort key is relevant for zone maps).

Data distribution has two goals :
1) *Avoid "hot partitions/shards"*, ie a partition of data where the request is centered, thus putting a lot of strain on the resources underlying the slice/node that hosts it, while the other slices in the cluster are under-utilized.
2) *Skip IO* : with a query like "SELECT * FROM my_table WHERE country='FRANCE'; " if this table was distributed according to the "country" column, then all the relevant data for this query was stored on the same partition, and can be returned quickly. Compute nodes can quickly identify the partitions where the data is and is not. (Note: "country" is probably not a good choice of partition key, as the resulting partitions will probably be too big, and therefore we'll result in a hot partition)
3) *Return relevant data together in few IO operations:* this is the same idea as 2) : when all relevant data is in the whole partition (or serveral partitions, all the data of those partitions is relevant), then all the data blocks of the partition are relevant, and the storage layer can read them quickly in just a few IO operations. For this reason, it makes sense to use as partition key a column that is often used as joining creteria (eg "JOIN mytable1 and mytable2 ON columnToUseAsPartitionKey")

There are 4 data distribution options for a table:
- *hash partitioning/sharding* (aka "distribution based on a *key*") : you choose a column (aka key) of your table that will be used as partition key. All the rows that share the same value for this column will be stored together in the same partition/shard. To get a good data dsitribution, and avoid having hot partitions, use a column with an even value distribution (to get even-sized partitions), and with high cardinality (to avoid having huge partitions, that can't be broken down in any smaller pieces), and on which your queries aren't focused on a single value (that's another reason why "country" is a bad partition key, because you'll typically query about facts regarding a single country, and therefore you'll put all the strain on the partition corresponding to that country, thus making it a hot partition ; for the same reason, don't use date as a partition key, because you'll typically query facts on a specific date, thus making the partition corresponding to that date a hot partition).
With hash partitioning, if you choose your distribution key right, you'll avoid hot partitions (gaol 1), skip IO (goal 2) and read your data in fewer IO operations (goal 3).
- *Even distribution:* the rows of your table are evenly spread among your slices, in a round robin fashion, row by row. The idea is to avoid hot partitions (goal 1), but you don't get goals 2 and 3.
- *ALL*: your table is duplicated on the 1st slice of each of your compute nodes. This only makes sense for "small" tables (ie <3M rows). Here there's actually no partitioning, so no goal 1, 2 or 3 ; but as your data is duplicated, data may be fetched from either slice, thus avoiding having a hot slice as if one slice is under strain, the request can go to another one having a copy of that data.
- *AUTO": at first data is stored in a "ALL" fashion, and when the table becomes too big, Redshift automatically redistribute the data of your table in an EVEN fashion.

Note that in all cases, your data is always duplicated twice. So I guess that with the "ALL" distribution, your data is duplicated 2 x <the number of nodes in your cluster> times.
[I am unsure about what follows]¿Also, note that even though the writing in a table (and therefore in the target partition) is done *row by row*, sorted according to the sort key(s) (sort column(s)), columns are stored together and compressed ¿for each 1 MB chunk of rows (¿like Apache ORC format/ Oracle HCC ?)?; and the DB blocks that compose each 1MB chunk are immutable. ¿And for each of those 1MB ORC/HCC chunks, a zone map is calculated? So if a table has 10 columns, adding a row to that table means that each 10 columns ¿for the chunk of rows corresponding to its ORC/HCC 1MB block? will have to be decompressed, have the new value coming for the new row added in each 10 columns, and then each of those 10 columns will be re-compressed. [Actually the 1MB chunk are immutable (see 30:02), so a whole new 1BM will have to be recreated.] So it's better to perform inserts in batch, to avoid having to do such heavy operations for just one added row at a time. ~Redshift is therefore probably not adapted to real-time analytics.?~ => No I think it’s okay as the data model is typically for historizations, so inserts would be appended to the table, in the same HCC/ORC set of row. Sure, adding one by one new rows is a heavier process than doing so in a row-oriented DB, but it’s a heavy process taking place “at the margin”, affecting just one HCC/ORC set of rows per table at a time.

25:00 data is always at least twice redundant in the cluster. Redshift uses 2-phase commit for transactions.

galeop
Автор

13:40 so zone maps are like "storage indexes" in Oracle Exadata ; they allow to skip unecessary IO by telling the storage server what are the min and max values of a compressed 1MB storage block, thus allowing the server to know if the values it looks for NOT in that block. Sort keys in Redshift are meant to determine how your table will be sorted on disk, and therefore impact the zone maps that track the min/max of the sort keys in a 1MB block.You'd typically set sort keys for frequently filtered columns, and have fewer than 4 sort keys per table (the table being sorted based on column X, and then column Y when column X had identical vallues, etc.). Use in priority low cardinality columns for sort keys, as they'll allow to make zone maps more meaningful (to know where the filtered data is not)

galeop
Автор

Wow! An intelligent and insightful deep dive into the hidden pearls of Redshift! Love the topic of Advisor! Thanks Tony & Harshida!

raghukundurthi
Автор

16:34 like all datawarehouses, you should use a denormalized data moldel ; something akin to a star schema or snowflake schema. In your star schema, the Fact table is a denormalised table where entities that belong to the same Fact have been pre-joinned ("materialized", ie they are stored joined together on the Fact table). You will also store in the Faact table pre-calculated aggregations (as you store those values on the table, they are "materialized").

galeop
Автор

Very vast for a beginner like me. But answered a lot of questions

RPRAVEENKUMAR
Автор

I have two question,
1. from performance point of view do we have any major difference between temp table and ctas with backup no?
2. For alter table append command what if the the source datatypes are same but data lengths are not matching will it still work?

shrabanti
Автор

Best way to use update statement where I am getting slow response

yasink
Автор

27:51 according to which criteria should the big S3 file to COPY be split into 16 smaller file ? Should we partition it according to the same distribution method as the one used in the table (eg same hash key)?

galeop
Автор

how much concurrency and CPU usage if you are running a Decent Size Query which takes about 4-5 hours and is made joining 2-3 tables. It runs in the very morning

Thevisionaryaddy
Автор

43:40 for WLM and queues, what are these percentages of memory about? Leader node's memory, compute node's memory, or both?

galeop
Автор

55:01 I suppose that altering the distribution key of a table means that all my table's partitions will be re-written from scratch, according to that new distribution key ; so it's a very heavy operation. Correct ?

galeop
Автор

I suppose, Advisor is not available in ap-south-1, I wonder why AWS is selective when releasing these kind add-on of services, .

dsapraveen