You can do it just using split and size of pyspark API functions (Below is
example):-
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
my_spark = SparkSession \
.builder \
.appName("Python Spark SQL example") \
.enableHiveSupport() \
.getOrCreate()
sqlcontext = SQLContext(my_spark.sparkContext)
sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()
Below is Output:-
+------------------+
|size(split(_1, ))|
+------------------+
| 5|
| 2|
+------------------+
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with necessary configuration
sc = SparkContext("local","PySpark Word Count Exmaple")
# read data from text file and split each line into words
words = sc.textFile("D:/workspace/spark/input.txt").flatMap(lambda line:
line.split(" "))
# count the occurrence of each word
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a
+b)
# save the counts to output
wordCounts.saveAsTextFile("D:/workspace/spark/output/")
Apache Spark is an open-source cluster computing framework for real-time
processing.
YARN is one of the key features in Spark, providing a central and resource
management platform to deliver scalable operations across the cluster.
YARN is a distributed container manager, like Mesos for example, whereas
Spark is a data processing tool.
Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN.
Running Spark on YARN necessitates a binary distribution of Spark as built on
YARN support.
Explain the concept of Resilient Distributed Dataset (RDD).
Resilient Distributed Datasets are the fundamental data structure of Apache
Spark.
It is embedded in Spark Core.
RDDs are immutable, fault-tolerant, distributed collections of objects that can
be operated on in parallel.RDD’s are split into partitions and can be executed
on different nodes of a cluster.
RDDs are created by either transformation of existing RDDs or by loading an
external dataset from stable storage like HDFS or HBase.
RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant
collection of operational elements that run in parallel. The partitioned data in
RDD is immutable and distributed in nature. There are primarily two types of
RDD:
Parallelized Collections: Here, the existing RDDs running parallel with one
another.
Hadoop Datasets: They perform functions on each file record in HDFS or other
storage systems.
RDDs are basically parts of data that are stored in the memory distributed
across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is
what contributes to Spark’s speed
The executor memory is basically a measure on how much memory of the worker node
will the application utilize
Spark manages data using partitions that help parallelize distributed data
processing with minimal network traffic for sending data between executors
park tries to read data into an RDD from the nodes that are close to it. Since
Spark usually accesses distributed partitioned data, to optimize transformation
operations it creates partitions to hold the data chunks. Everything in Spark is
a partitioned RDD.
RDD has distributed a collection of objects. Distributed means, each RDD is
divided into multiple partitions. Each of these partitions can reside in memory
or stored on the disk of different machines in a cluster. RDDs are immutable
(Read Only) data structure. You can’t change original RDD, but you can always
transform it into different RDD with all changes you want.
Spark applications run as independent processes that are coordinated by the
SparkSession object in the driver program. The resource manager or cluster
manager assigns tasks to the worker nodes with one task per partition. Iterative
algorithms apply operations repeatedly to the data so they can benefit from
caching datasets across iterations. A task applies its unit of work to the
dataset in its partition and outputs a new partition dataset. Finally, the
results are sent back to the driver application or can be saved to the disk.
What is a Parquet file and what are its advantages?
Parquet is a columnar format that is supported by several data processing
systems. With the Parquet file, Spark can perform both read and write
operations.
Some of the advantages of having a Parquet file are:
It enables you to fetch specific columns for access.
It consumes less space
It follows the type-specific encoding
It supports limited I/O operations
11. What is shuffling in Spark? When does it occur?
Shuffling is the process of redistributing data across partitions that may lead
to data movement across the executors. The shuffle operation is implemented
differently in Spark compared to Hadoop.
Shuffling has 2 important compression parameters:
spark.shuffle.compress – checks whether the engine would compress shuffle
outputs or not spark.shuffle.spill.compress – decides whether to compress
intermediate shuffle spill files or not
It occurs while joining two tables or while performing byKey operations such as
GroupByKey or ReduceByKey
13. How can you calculate the executor memory?
Consider the following cluster information:
cluster
node=10
each node
core =16 (-1 OS)
RAM = 61GB (-1 for OS)
Here is the number of core identification:
Number of core is - number of concurrent task an executor can run in parallel.
rule of tumb = 5
To calculate the number of executor identification:
no of execuotr = no. of core/concurrent task
= 15 / 5 =3
number of executor = no of node * no of executore on each node
10 * 3 = 30
executor.
Top 40 Apache Spark Interview Questions and Answers
By Shivam Arora
Last updated on Jul 1, 202161364
Top 40 Apache Spark Interview Questions and Answers
Apache Spark is a unified analytics engine for processing large volumes of data.
It can run workloads 100 times faster and offers over 80 high-level operators
that make it easy to build parallel apps. Spark can run on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud, and can access data from multiple
sources.
And this article covers the most important Apache Spark Interview questions that
you might face in a Spark interview. The Spark interview questions have been
segregated into different sections based on the various components of Apache
Spark and surely after going through this article you will be able to answer
most of the questions asked in your next Spark interview.
Apache Spark Interview Questions
The Apache Spark interview questions have been divided into two parts:
Apache Spark Interview Questions for Beginners
Apache Spark Interview Questions for Experienced
Let us begin with a few basic Apache Spark interview questions!
Apache Spark Interview Questions for Beginners
1. How is Apache Spark different from MapReduce?
Apache Spark
MapReduce
Spark processes data in batches as well as in real-time
MapReduce processes data in batches only
Spark runs almost 100 times faster than Hadoop MapReduce
Hadoop MapReduce is slower when it comes to large scale data processing
Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it
Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve
the data
Spark provides caching and in-memory data storage
Hadoop is highly disk-dependent
2. What are the important components of the Spark ecosystem?
apache-spark
Apache Spark has 3 main categories that comprise its ecosystem. Those are:
Language support: Spark can integrate with different languages to applications
and perform analytics. These languages are Java, Python, Scala, and R.
Core Components: Spark supports 5 main core components. There are Spark Core,
Spark SQL, Spark Streaming, Spark MLlib, and GraphX.
Cluster Management: Spark can be run in 3 environments. Those are the Standalone
cluster, Apache Mesos, and YARN.
3. Explain how Spark runs applications with the help of its architecture.
This is one of the most frequently asked spark interview questions, and the
interviewer will expect you to give a thorough answer to it.
worker-node
Spark applications run as independent processes that are coordinated by the
SparkSession object in the driver program. The resource manager or cluster
manager assigns tasks to the worker nodes with one task per partition. Iterative
algorithms apply operations repeatedly to the data so they can benefit from
caching datasets across iterations. A task applies its unit of work to the
dataset in its partition and outputs a new partition dataset. Finally, the
results are sent back to the driver application or can be saved to the disk.
4. What are the different cluster managers available in Apache Spark?
Standalone Mode: By default, applications submitted to the standalone mode
cluster will run in FIFO order, and each application will try to use all
available nodes. You can launch a standalone cluster either manually, by
starting a master and workers by hand, or use our provided launch scripts. It is
also possible to run these daemons on a single machine for testing.
Apache Mesos: Apache Mesos is an open-source project to manage computer
clusters, and can also run Hadoop applications. The advantages of deploying
Spark with Mesos include dynamic partitioning between Spark and other frameworks
as well as scalable partitioning between multiple instances of Spark.
Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can
be run on YARN as well.
Kubernetes: Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
5. What is the significance of Resilient Distributed Datasets in Spark?
Resilient Distributed Datasets are the fundamental data structure of Apache
Spark. It is embedded in Spark Core. RDDs are immutable, fault-tolerant,
distributed collections of objects that can be operated on in parallel.RDD’s are
split into partitions and can be executed on different nodes of a cluster.
RDDs are created by either transformation of existing RDDs or by loading an
external dataset from stable storage like HDFS or HBase.
Here is how the architecture of RDD looks like:
create-rdd
6. What is a lazy evaluation in Spark?
When Spark operates on any dataset, it remembers the instructions. When a
transformation such as a map() is called on an RDD, the operation is not
performed instantly. Transformations in Spark are not evaluated until you
perform an action, which aids in optimizing the overall data processing
workflow, known as lazy evaluation.
7. What makes Spark good at low latency workloads like graph processing and
Machine Learning?
Apache Spark stores data in-memory for faster processing and building machine
learning models. Machine Learning algorithms require multiple iterations and
different conceptual steps to create an optimal model. Graph algorithms traverse
through all the nodes and edges to generate a graph. These low latency workloads
that need multiple iterations can lead to increased performance.
8. How can you trigger automatic clean-ups in Spark to handle accumulated
metadata?
To trigger the clean-ups, you need to set the parameter spark.cleaner.ttlx.
job
9. How can you connect Spark to Apache Mesos?
There are a total of 4 steps that can help you connect Spark to Apache Mesos.
Configure the Spark Driver program to connect with Apache Mesos
Put the Spark binary package in a location accessible by Mesos
Install Spark in the same location as that of the Apache Mesos
Configure the spark.mesos.executor.home property for pointing to the location
where Spark is installed
10. What is a Parquet file and what are its advantages?
Parquet is a columnar format that is supported by several data processing
systems. With the Parquet file, Spark can perform both read and write
operations.
Some of the advantages of having a Parquet file are:
It enables you to fetch specific columns for access.
It consumes less space
It follows the type-specific encoding
It supports limited I/O operations
Learn open-source framework and scala programming languages with the Apache
Spark and Scala Certification training course.
11. What is shuffling in Spark? When does it occur?
Shuffling is the process of redistributing data across partitions that may lead
to data movement across the executors. The shuffle operation is implemented
differently in Spark compared to Hadoop.
Shuffling has 2 important compression parameters:
spark.shuffle.compress – checks whether the engine would compress shuffle
outputs or not spark.shuffle.spill.compress – decides whether to compress
intermediate shuffle spill files or not
It occurs while joining two tables or while performing byKey operations such as
GroupByKey or ReduceByKey
12. What is the use of coalesce in Spark?
Spark uses a coalesce method to reduce the number of partitions in a DataFrame.
Suppose you want to read data from a CSV file into an RDD having four
partitions.
partition
This is how a filter operation is performed to remove all the multiple of 10
from the data.
The RDD has some empty partitions. It makes sense to reduce the number of
partitions, which can be achieved by using coalesce.
This is how the resultant RDD would look like after applying to coalesce.
13. How can you calculate the executor memory?
Consider the following cluster information:
cluster
Here is the number of core identification:
core-iden
To calculate the number of executor identification:
executor.
14. What are the various functionalities supported by Spark Core?
Spark Core is the engine for parallel and distributed processing of large data
sets. The various functionalities supported by Spark Core include:
Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching
15. How do you convert a Spark RDD into a DataFrame?
There are 2 ways to convert a Spark RDD into a DataFrame:
Using the helper function - toDF
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field(“first_name”) === “Peter”)
.select(“_id”, “first_name”).toDF()
Using SparkSession.createDataFrame
You can convert an RDD[Row] to a DataFrame by
calling createDataFrame on a SparkSession object
def createDataFrame(RDD, schema:StructType)
16. Explain the types of operations supported by RDDs.
RDDs support 2 types of operation:
Transformations: Transformations are operations that are performed on an RDD to
create a new RDD containing the results (Example: map, filter, join, union)
Actions: Actions are operations that return a value after running a computation
on an RDD (Example: reduce, first, count)
17. What is a Lineage Graph?
This is another frequently asked spark interview question. A Lineage Graph is a
dependencies graph between the existing RDD and the new RDD. It means that all
the dependencies between the RDD will be recorded in a graph, rather than the
original data.
The need for an RDD lineage graph happens when we want to compute a new RDD or
if we want to recover the lost data from the lost persisted RDD. Spark does not
support data replication in memory. So, if any data is lost, it can be rebuilt
using RDD lineage. It is also called an RDD operator graph or RDD dependency
graph.
18. What do you understand about DStreams in Spark?
Discretized Streams is the basic abstraction provided by Spark Streaming.
It represents a continuous stream of data that is either in the form of an input
source or processed data stream generated by transforming the input stream.
dstream.
19. Explain Caching in Spark Streaming.
Caching also known as Persistence is an optimization technique for Spark
computations. Similar to RDDs, DStreams also allow developers to persist the
stream’s data in memory. That is, using the persist() method on a DStream will
automatically persist every RDD of that DStream in memory. It helps to save
interim partial results so they can be reused in subsequent stages.
The default persistence level is set to replicate the data to two nodes for
fault-tolerance, and for input streams that receive data over the network.
kafka
20. What is the need for broadcast variables in Spark?
Broadcast variables allow the programmer to keep a read-only variable cached on
each machine rather than shipping a copy of it with tasks. They can be used to
give every node a copy of a large input dataset in an efficient manner. Spark
distributes broadcast variables using efficient broadcast algorithms to reduce
communication costs.
scala
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
Apache Spark Beginners Free Course
Learn the Fundamentals of Apache SparkENROLL NOWApache Spark Beginners Free
Course
Apache Spark Interview Questions for Experienced
21. How to programmatically specify a schema for DataFrame?
DataFrame can be created programmatically with three steps:
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in
the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by
SparkSession.
spark-session
22. Which transformation returns a new DStream by selecting only those records
of the source DStream for which the function returns true?
1. map(func)
2. transform(func)
3. filter(func)
4. count()
The correct answer is c) filter(func).
23. Does Apache Spark provide checkpoints?
This is one of the most frequently asked spark interview questions where the
interviewer expects a detailed answer (and not just a yes or no!). Give as
detailed an answer as possible here.
Yes, Apache Spark provides an API for adding and managing checkpoints.
Checkpointing is the process of making streaming applications resilient to
failures. It allows you to save the data and metadata into a checkpointing
directory. In case of a failure, the spark can recover this data and start from
wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving
the metadata to fault-tolerant storage like HDFS. Metadata includes
configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need
arises in some of the stateful transformations. In this case, the upcoming RDD
depends on the RDDs of previous batches.
24. What do you mean by sliding window operation?
Controlling the transmission of data packets between multiple computer networks
is done by the sliding window. Spark Streaming library provides windowed
computations where the transformations on RDDs are applied over a sliding window
of data.
originald-stream
25. What are the different levels of persistence in Spark?
DISK_ONLY - Stores the RDD partitions only on the disk
MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte
array per partition
MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD
is not able to fit in the memory available, some partitions won’t be cached
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case
the RDD is not able to fit in the memory, additional partitions are stored on
the disk
MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing
partitions not able to fit in the memory to the disk
26. What is the difference between map and flatMap transformation in Spark
Streaming?
map()
flatMap()
A map function returns a new DStream by passing each element of the source
DStream through a function func
It is similar to the map function and applies to each element of RDD and it
returns the result as a new RDD
Spark Map function takes one element as an input process it according to custom
code (specified by the developer) and returns one element at a time
FlatMap allows returning 0, 1, or more elements from the map function. In the
FlatMap operation
27. How would you compute the total count of unique words in Spark?
1. Load the text file as RDD:
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
2. Function that breaks each line into words:
def toWords(line):
return line.split();
3. Run the toWords function on each element of RDD in Spark as flatMap
transformation:
words = line.flatMap(toWords);
4. Convert each word into (key,value) pair:
def toTuple(word):
return (word, 1);
wordTuple = words.map(toTuple);
5. Perform reduceByKey() action:
def sum(x, y):
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
=========================================================
1)
paired RDD
2)
coalesce and repartition
repartion can increase and decrease the partition and does full shuffle.
Coalesce avoids full shuffle and better for reduced number of partitions
Spark works fine when the partitions sizes are even, Coalesce might give an
uneven sized partitions and that would impact the performance of the job
3)
Different level of persistence
memory only
disk only
memory and disk
4)
difference static and dynamic partition in Hive
5)
Parquet -
columer store
good for compression
analytical functions
6)
Bucketting with dynamic partition - is it possible ?
7)
second higest salary
8)
Can we use hive as metadata DB instead of mysql
9)
what is sqoop
10)
expalin map reduce flow
mapper => key-value pair => shuffle =>reducer
Assume 10GB data.how may mapper and reducer will be created
10GB / 128MB = number of mapper
by default reducer =1 , we an specify number fo reducer 2 or 3
how does data get transfered from mapper to reducer
shuffle phase automatically handle it
if you have more then 1 reducer how data get into reducr
in shuffle phase the key - value pair goes through partitioner,
partitioner decide where KV should go. partitioner is hash function.
for each key it generate hash which goes to fixed reducer
11)
when to use Hive and when to use Spark
hive - is client tool - any query will be executed as Map reducer on hadoop
cluster.
query get translated into MR
spark is execution framework.it creates lineage, form pysicl and logical plan
and execute.
12)
does hive support hive and curd
CURD (update delete) and ACId (Transcation) is experimental feature which i only
available for ORC (optimise RC file format)
available from Apache 0.1.3
13)
what is diff between RDD and dataframe
In spark when we write instruction, those instruction are transformed into RDD
and spark will form the association between RDD based on instr
then those instruction are converted to logical and pyhiscal plan
Dataframe is improvement for RDD
Dataframe give facility to view at RDD with schema
as if we are looking to it as table
pulgable memory management. like project tungston
and better optimiser (catalyst optiiser)
14)
Oozie
scheduler tool
it helps to form workflow
help to maintain dependency.
15)
Flume and Kafka
are data ingestion tool
Kafla is cluster in itslef. It is used for streaming.
kafka is distributed. can get stream from multiple producer and give data to
multiple consumer
data can be pushed to hdfs
16)
Sentry
Its authorization tool.
17)
https://www.youtube.com/watch?v=haPwh4m_jq0
Spark is memory intensive distributed system. With access t o several GB of
memory.
But yet job fails with out of memory issue.
Driver and Executor are the JVM process.
they have fixed memory allocated
when spark job fails, identify what is failing.
Spark job is made of drive and set of executor. They both use large amount of
memory
they both are allocated certain finite amount of memory
When they need more then what is allocated we get OOM
Check whether driver failed or executor failed or Task inside executor failed.
Spark job consist of numbe rof task.
Task run insdie Executor container
Sevral spark job can run inside container (executor)
Executor are the containers with CPU and memory allocated.
like this multiple executor (container) can exists for the job
Error like OOM or container killed will be showing up when memory issue happens.
Driver is orchestrating the execution between executors
both memory can be controlled by
spark.executor.memory
spark.driver.memory
If 4 GB is allocated to Executor
300 MB is reserved
60 % of (4GB - 300M ) - will go to Spark = spark memory
rest to user memory
out of 60 % of (4GB - 300M ) (Spark Memory)
some is used for storage(cached obj) and rest execution
spark memory is dynamically managed (available after 1.16 version .)
Initially storage and execution get 50 -50 % of spark memory (60% of 4gb -
300mb)
4gb = 300MB reserved
1.5 user memory (40% of 4gb - 300M)
2.2 spark Memory (60% of 4gb - 300M)
1.1 storage
1.1 Execution
when memory for execution is not enough spark with take memory from storage .
If storage have object, it will move to Disk.
Storage memory inside spark memory - is managed by spark.memory.storageFraction
which is 50%
How Many cores are allocated to the executor
number of cores mean number of task we can run in parallel.
If we have 2 CPU core then we can run 2 task in parallel.
# CPU Core = # Task in parallel
rule of thumb = for each CPU 2or 3 task
Number of partition decides the number of Task
so if we have big data set which have 100 partition
then spark will create 100 task.
If each task have to deal with lot of data then potential problem to get OOM
few partition mean - Each task is overload,
Q)
How to optimise Spark
Choose file format - JSON and XML are slow. No optimization
PARQUET or RC will give better performance
Partition Technique
rdd1 = sc.parallelize(Array((a,3), (a,1), (b,7), (a,5), (c,5)),3 )
.partitionBy(new ashpartition(2))
rdd2 = sc.parallelize(Array((c,5), (d,1), (b,5), (d,5)),2 ) .partitionBy(new
ashpartition(2))
BroadCast :- broad cst certain data to all executor
Kryo Seriallizer:-
spark serialize data. for all operation
default java serailzer are bulky
jrdd = rdd1.join(rdd2)
jrdd.collect().foreach(println)
Q)
The convergence of SQL and NoSQL
Both SQL and NoSQL databases have their pros and cons. As such, there has been a
movement to take the best characteristics of both types of databases and
integrate them so users can realize the best of both worlds.
For instance, MySQL, the most popular open-source relational database, offers
MySQL Document Store. This provides the structure of a MySQL database combined
with the flexibility and high availability of NoSQL without having to implement
a separate NoSQL database.
MongoDB, one of the most popular NoSQL databases, offers multi-document ACID
transactions.
AWS’ managed NoSQL database, DynamoDB, also provides ACID-compliant transaction
functionality.
And with the easy database setup that cloud service providers offer, you have
the ability to use both SQL and NoSQL databases in your cloud data architecture
to meet your data storage needs.
Now you have much more flexibility regardless of whether you choose a SQL or
NoSQL database, and there are sure to be more flexible options in the future.
Database options
Regardless of whether you go with a SQL or NoSQL database (or both!), there are
plenty of options to choose from.
On-premise SQL database offerings include:
MySQL – as mentioned prior, the most popular open-source relational database
Microsoft SQL server – Microsoft’s enterprise version of SQL
PostgreSQL – and enterprise-level, open-source database focused on extensibility
Oracle – full-service (and expensive) SQL option
MariaDB – an enhanced version of MySQL, built by MySQL’s original developers
And many more
The major cloud service platforms have their own SQL options:
AWS has:
RDS, their standard cloud SQL database
Aurora, which focuses on increased throughput and scalability
Microsoft Azure has:
Azure SQL Database, their managed database-as-a-service
Azure Database for MySQL, PostgreSQL, and MariaDB
Google Cloud Platform (GCP) has:
Cloud SQL, which you can use for MySQL and PostgreSQL
Cloud Spanner, which combines elements of SQL and NoSQL
On-premise NoSQL database options include:
MongoDB – by far the most popular NoSQL database
Redis – an open source, distributed, in-memory key-value database that is super
fast
Cassandra – free, open-source NoSQL database created by Facebook that focuses on
scalability and high availability
Many others
Cloud service providers offer plenty of NoSQL options as well:
AWS has:
DynamoDB, its managed NoSQL database
DocumentDB, a fast, scalable, highly-available MongoDB-compatible database
Microsoft Azure offers:
CosmosDB, its globally distributed, multi-model database
Google Cloud has:
Bigtable, its NoSQL wide-column database service
Cloud Datastore, its NoSQL document database service
Cloud Firestore, a cloud-native NoSQL document database that helps store and
query app data
There is no shortage of database options to choose from!
Q)
nodes 54
Vcore 32
Memorty 32
CSH 5.16
Impala 2.12
impala no of deamon = 250
overall mem 1TB