Home » NoSQL » Exploring Dynamo DB

Exploring Dynamo DB

Read the basics on Dynamo DB

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html

Working with Tables

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithDDTables.html

Points to be Noted

Primary Key:

1) Hash Key

2) Hash and Range key

Throughput Requirements:

1) Read Capacity Units – Number of strongly consistent reads per second of items up to 4 KB in size per second

2) Write Capacity Units – Number of 1 KB writes per second

Throughput Decision making:

Consider the following while deciding on the Throughput needs

1) Item Size

  • Capacity units are based on a data item size of 4 KB per read or 1 KB per write

2) Expected Read and Write request

  • Expected number of read and write operations your application will perform against the table, per second

3) Consistency

  • Read capacity units are based on strongly consistent read operations, which require more effort and consume twice as many database resources as eventually consistent reads
  • By default, Amazon DynamoDB read operations are eventually consistent

4) Local Secondary Index

  • Define one or more local secondary indexes on a table, you must do so at table creation time
  • Queries against indexes consume provisioned read throughput
  • Capacity units consumed by index operations are charged against the table’s provisioned throughput

Best Practices for working with Tables:

Uniform Data Access Across Items:

  • Provisioned throughput is dependent on the primary key selection, and the workload patterns on individual items
  • Amazon DynamoDB divides a table’s items into multiple partitions
  • Distributes the data primarily based on the hash key element
  • Provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions
  • Total provisioned throughput/partitions = throughput per partition
  • To achieve the full amount of provisioned throughput for a table, keep the workload spread evenly across the hash key values

Distribute Write Activity during Data Upload:

  • Amazon DynamoDB partitions your table data on multiple servers
  • Upload data to all the allocated servers simultaneously to achieve better performance

Understand Access Patterns for Time Series Data:

  • Instead of storing all items in a single table, try using multiple tables to store these items
  • Create tables to store monthly or weekly data
  • Data access rate is high in latest month or week table and thereby high throughput

AWS Eclipse Setup

Set up the IDE

Eclipse – http://aws.amazon.com/eclipse/ – Install the needed toolkits – viz DynamoDB, EC2..

Setup the AWS SDK into Eclipse.

Install the AWS setup with the Access Key. Secret Key and the PEM file.

Create a AWS Java project with DynamoDB sample code.

Run and execute to check if the code is able to connect, create, insert and scan the DynamoDB table.

Duplicate the code and customize for a new table and structure

Explore DynamoDB and Redshift Integration:

TBD

Limitations/ Pinchpoints in DynamoDB:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html

Points to be noted:

  1. Provisioned throughput limits
    • S East (Northern Virginia) Region:
      • Per table – 40,000 read capacity units or 40,000 write capacity units
      • Per account – 80,000 read capacity units or 80,000 write capacity units
  2. Item size
    • Cannot exceed 64 KB which includes both attribute name binary length (UTF-8 length) and attribute value lengths (again binary length)
  3. Attribute Value
    • Attribute values cannot be null or empty
  4. Query
    • Result set limited to 1 MB per API call. You can use the LastEvaluatedKey from the query response to retrieve more results.
  5. Scan
    • Scanned data set size maximum is 1 MB per API call. You can use the LastEvaluatedKey from the scan response to retrieve more results.

Access Control:

  • Amazon DynamoDB integrates with AWS Identity and Access Management (IAM)
  • IAM lets you give other users of your AWS Account access to Amazon DynamoDB tables within the AWS Account
  • IAM can’t give another AWS Account (or Users in another AWS Account) access to the AWS Account’s Amazon DynamoDB tables
  • Amazon DynamoDB supports IAM policies using ARNs per table and require the account ID
    • “Resource”: “arn:aws:dynamodb:<region>:<accountID>:table/<tablename>”
  • When using Query API on a Local Secondary Index by passing in the IndexName parameter, the ARN includes the name of the index in the following format
    • “Resource”: “arn:aws:dynamodb:<region>:<accountID>:table/<tablename>/index/<indexname>”

Querying DynamoDB with EMR:

EMR will be launched for each of the activities with DynamoDB

  • Exporting data stored in Amazon DynamoDB to Amazon S3.
  • Importing data in Amazon S3 to Amazon DynamoDB.
  • Querying live Amazon DynamoDB data using SQL-like statements (HiveQL).
  • Joining data stored in Amazon DynamoDB and exporting it or querying against the joined data.
  • Loading Amazon DynamoDB data into the Hadoop Distributed File System (HDFS) and using it as input into an Amazon EMR job flow.

Impact on Throughput Provisioning:

Amazon EMR read and write operations on an Amazon DynamoDB table

  • Count against established provisioned throughput
  • Potentially increasing the frequency of provisioned throughput exceptions
  • For large requests, Amazon EMR implements retries with exponential backoff to manage the request load on the Amazon DynamoDB table
  • Running Amazon EMR jobs concurrently with other traffic may cause to exceed the allocated provisioned throughput level.
  • Monitor the throughput by checking the ThrottleRequests metric in Amazon CloudWatch
  • If the request load is too high, Relaunch the job flow
  • Set Read Percent Setting and Write Percent Setting to lower values to throttle the Amazon EMR read and write operations
  • Amazon EMR implements its own logic to try to balance the load on your Amazon DynamoDB table to minimize the possibility of exceeding your provisioned throughput
  • The integration of Amazon DynamoDB with Amazon EMR does not currently support Binary and Binary Set type attributes.

Facts of Hive Commands on DynamoDB:

  • Operations on a Hive table reference data stored in Amazon DynamoDB
  • Hive commands are subject to the Amazon DynamoDB table’s provisioned throughput settings
  • Hive commands DROP TABLE and CREATE TABLE only act on the local tables in Hive and do not create or drop tables in Amazon DynamoDB
  • Amazon EMR returns information about the cluster used to process the query, including how many times your provisioned throughput was exceeded
  • When you map a Hive table to a location in Amazon S3, do not map it to the root path of the bucket, s3://mybucket

Possible Hive Actions on DynamoDB:

Refer http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html

  • Export an Amazon DynamoDB table to an Amazon S3 bucket
  • Export an Amazon DynamoDB table to an Amazon S3 bucket using formatting
  • Export an Amazon DynamoDB table to an Amazon S3 bucket without specifying a column mapping
  • Export an Amazon DynamoDB table to an Amazon S3 bucket using data compression
  • Export an Amazon DynamoDB table to HDFS
  • Read non-printable UTF-8 character data in Hive
  • Import a table from Amazon S3 to Amazon DynamoDB
  • Import a table from an Amazon S3 bucket to Amazon DynamoDB without specifying a column mapping
  • Import a table from HDFS to Amazon DynamoDB
  • Find the largest value for a mapped column
  • Aggregate data using the GROUP BY clause
  • Join two Amazon DynamoDB tables
  • Join two tables from different sources

Optimizations on EMR over DynamoDB:

Following factors influence Hive query performance over DynamoDB tables

  • Provisioned Read Capacity Units
  • Read Percent Setting
  • Write Percent Setting
  • Retry Duration Setting
  • Number of Map Tasks
  • Parallel Data Requests
  • Process Duration
  • Avoid Exceeding Throughput
  • Request Time
  • Time-Based Tables
  • Archived Data
  • Viewing Hadoop Logs

Hands on Hive over DynamoDB:

Creating hive table

create database ddb;
CREATE EXTERNAL TABLE GeoIPMappingHive (ip String, isp String,longitude double,latitude double, iso2CountryCode String, city String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "GeoIPMapping", 
"dynamodb.column.mapping" = "ip:ip,isp:isp,longitude:long,latitude:lat,iso2CountryCode:cntrycd,city:city");

Export DynamoDB to S3:

CREATE EXTERNAL TABLE geoip_s3_export(ip String, isp String,longitude double,latitude double, iso2CountryCode String, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ariyabala/POC/DynamoDB/sample';
 INSERT OVERWRITE TABLE geoip_s3_export SELECT * FROM GeoIPMappingHive;

Import S3 to DynamoDB:


CREATE EXTERNAL TABLE geoip_s3_import0(rowkey String, ip String, isp String,longitude double,latitude double, iso2CountryCode String, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://ariyabala/POC/HBase/output/ip_hist_22_aug/0'; 

CREATE EXTERNAL TABLE GeoIPMappingHive (ip String, isp String,longitude double,latitude double, iso2CountryCode String, city String) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("hive.jdbc.update.on.duplicate"="true","dynamodb.table.name" = "GeoIPMapping", "dynamodb.column.mapping" = "ip:ip,isp:isp,longitude:long,latitude:lat,iso2CountryCode:cntrycd,city:city"); 
INSERT INTO TABLE GeoIPMappingHive IF NOT EXISTS SELECT ip,isp,longitude,latitude,iso2CountryCode,city FROM geoip_s3_import0;

With Read and write capacity of 20000 units, below is the statistics of the import

MapReduce Total cumulative CPU time: 21 seconds 780 msec
Ended Job = job_201308282023_0090
Counters:
175428 Rows loaded to geoipmappinghive
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 21.78 sec HDFS Read: 218 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 780 msec
OK
Time taken: 204.207 seconds

Performance on Full Scan:

select count(*) from geoipmappinghive;

With 40000 Throughput:

Ended Job = job_201308282023_0242
Counters:
MapReduce Jobs Launched:
Job 0: Map: 35 Reduce: 1 Cumulative CPU: 2180.53 sec HDFS Read: 11259 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 36 minutes 20 seconds 530 msec
OK
76953278
Time taken: 222.463 seconds, Fetched: 1 row(s)

With 20000 Throughput:

Ended Job = job_201308282023_0243
Counters:
MapReduce Jobs Launched:
Job 0: Map: 35 Reduce: 1 Cumulative CPU: 2186.27 sec HDFS Read: 11259 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 36 minutes 26 seconds 270 msec
OK
76953278
Time taken: 222.183 seconds, Fetched: 1 row(s)

With 10000 Throughput:

MapReduce Total cumulative CPU time: 36 minutes 23 seconds 30 msec
Ended Job = job_201308282023_0244
Counters:
MapReduce Jobs Launched:
Job 0: Map: 35 Reduce: 1 Cumulative CPU: 2183.03 sec HDFS Read: 11259 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 36 minutes 23 seconds 30 msec
OK
76953278
Time taken: 225.322 seconds, Fetched: 1 row(s)

With 5000 Throughput:

Ended Job = job_201308282023_0246
Counters:
MapReduce Jobs Launched:
Job 0: Map: 35 Reduce: 1 Cumulative CPU: 2246.46 sec HDFS Read: 11259 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 37 minutes 26 seconds 460 msec
OK
76953278
Time taken: 340.467 seconds, Fetched: 1 row(s)

With 1000 Throughput:

Ended Job = job_201308282023_0245
Counters:
MapReduce Jobs Launched:
Job 0: Map: 15 Reduce: 1 Cumulative CPU: 2480.51 sec HDFS Read: 4745 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 41 minutes 20 seconds 510 msec
OK
76953278
Time taken: 1521.753 seconds, Fetched: 1 row(s)

With HBase:

Ended Job = job_201308282023_0249
Counters:
MapReduce Jobs Launched:
Job 0: Map: 57 Reduce: 1 Cumulative CPU: 16254.71 sec HDFS Read: 23973 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 30 minutes 54 seconds 710 msec
OK
_c0
76953278
Time taken: 2611.884 seconds, Fetched: 1 row(s)

Performance Comparison:

Query Technology Throughput Read units Time Taken (sec) CPU Time Taken (sec) Table Scan count Mappers Reducers
count(*) DynamoDB 40000 222.463 2180.53 76953278 35 1
20000 222.183 2186.27 76953278 35 1
10000 225.322 2183.03 76953278 35 1
5000 340.467 2246.46 76953278 35 1
1000 1521.753 2480.51 76953278 15 1
HBase Auto 2611.884 16254.71 76953278 57 1
MongoDB
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: