@@ -47,3 +47,72 @@ In MongoDB deployments with mixed versions of :binary:`~bin.mongod`, it is
4747possible to get an ``Unrecognized pipeline stage name: '$sample'``
4848error. To mitigate this situation, explicitly configure the partitioner
4949to use and define the Schema when using DataFrames.
50+
51+ How do I use MongoDB BSON types that are unsupported in Spark?
52+ --------------------------------------------------------------
53+
54+ Some custom MongoDB BSON types, such as ``ObjectId``, are unsupported
55+ in Spark.
56+
57+ The MongoDB Spark Connector converts custom MongoDB data types to and
58+ from extended JSON-like representations of those data types that are
59+ compatible with Spark. See :ref:`<bson-spark-datatypes>` for a list of
60+ custom MongoDB types and their Spark counterparts.
61+
62+ Spark Datasets
63+ ~~~~~~~~~~~~~~
64+
65+ To create a standard Dataset with custom MongoDB data types, use
66+ ``fieldTypes`` helpers:
67+
68+ .. code-block:: scala
69+
70+ import com.mongodb.spark.sql.fieldTypes
71+
72+ case class MyData(id: fieldTypes.ObjectId, a: Int)
73+ val ds = spark.createDataset(Seq(MyData(fieldTypes.ObjectId(new ObjectId()), 99)))
74+ ds.show()
75+
76+ The preceding example creates a Dataset containing the following fields
77+ and data types:
78+
79+ - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
80+ by ``fieldTypes.ObjectId``.
81+
82+ - The ``a`` field is an ``Int``, a data type available in Spark.
83+
84+ Spark DataFrames
85+ ~~~~~~~~~~~~~~~~
86+
87+ To create a DataFrame with custom MongoDB data types, you must supply
88+ those types when you create the RDD and schema:
89+
90+ - Create RDDs using custom MongoDB BSON types
91+ (e.g. ``ObjectId``). The Spark Connector handles converting
92+ those custom types into Spark-compatible data types.
93+
94+ - Declare schemas using the ``StructFields`` helpers for data types
95+ that are not natively supported by Spark
96+ (e.g. ``StructFields.objectId``). Refer to
97+ :ref:`<bson-spark-datatypes>` for the mapping between BSON and custom
98+ MongoDB Spark types.
99+
100+ .. code-block:: scala
101+
102+ import org.apache.spark.sql.Row
103+ import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
104+ import com.mongodb.spark.sql.helpers.StructFields
105+
106+ val data = Seq(Row(Row(new ObjectId().toHexString()), 99))
107+ val rdd = spark.sparkContext.parallelize(data)
108+ val schema = StructType(List(StructFields.objectId("id", true), StructField("a", IntegerType, true)))
109+ val df = spark.createDataFrame(rdd, schema)
110+ df.show()
111+
112+ The preceding example creates a DataFrame containing the following
113+ fields and data types:
114+
115+ - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
116+ by ``StructFields.objectId``.
117+
118+ - The ``a`` field is an ``Int``, a data type available in Spark.
0 commit comments